...

5 Advanced Topics

by user

on
Category: Documents
23

views

Report

Comments

Transcript

5 Advanced Topics
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 47
5
Advanced Topics
When we want to use MD simulations in research, efficiency is often of great importance. Several ideas
for improving the efficiency of your MD codes are given in this chapter.
5.1
Bottleneck is the FORCE
As already explained, typical MD programs are composed of several elements:
Initialize
Force
Move
Data accumulation
Finalize
Figure 5-25: Typical structure of an MD simulation program.
Two components, FORCE and MOVE, are essential. Roughly speaking33) the FORCE in the original form
requires the computational time proportional to N 2 (N : number of particles), and the MOVE proportional
to N . Thus, as the system size increases, the efficiency of FORCE part determines the simulation speed.
There are several free tools to measure “lap time” of each component, as shown in Fig. 5-26 for example.
pulse vap:
repeat
回数
Figure 5-26: Example of performance analysis for an MD code; taken from R. Nabika, Master Thesis,
Kyoto Univ. (2011).
33) In this computational time estimation, we assume a two-body interaction. If you use a many-body model, the time will
extremely increase, of course.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 48
5.2
From physical viewpoints
5.2.1
Potential cut-off
Here given is a simple idea based on a physical property of the particle interaction, i.e., the potential
cut-off (or truncation).
The interaction potential φ between particles normally converges to zero rapidly enough as the
distance r becomes large. We can truncate the potential at some distance rc .34) Normally, rc = 3σ ∼ 5σ
is often adopted for LJ systems. A pseudo-code is shown in Fig. 5-27. Note that, however, the truncation
itself does not save much time, because there still exists the full dual loop, in which the distance is
calculated for every pair of particles. Two algorithms are often adopted for further improvement.
5
Potential
Force
for (i=1;i<NATOM;i++) {
for (j=0;j<i;j++) {
dx=posx[i]-posx[j];
dy=posy[i]-posy[j];
dz=posz[i]-posz[j];
r2= dx*dx +dy*dy +dz*dz;
if (r2 < RCUT*RCUT) {
calculation of force and potential energy
}
} }
u [ε] or f [ε/σ]
4
3
2
1
0
-1
-2
-3
0
1
2
3
4
r [σ]
rc /σ
2.5
3.0
3.5
4.0
5.0
6.0
7.0
φ(rc )/
-0.016316891136
-0.005479441744
-0.002174780392
-0.000976324081
-0.000255983616
-0.000085732044
-0.000033999150
Figure 5-27: Pseudo code for potential truncation, and the error [value φ(rc )] for the LJ potential at
r = rc .
5.2.2
Book-keeping method
For a given particle i, the number of surrounding particles j which have possibility to interact with i is
limited to be fairly small in usual situations; for liquids and solids under normal conditions, the typical
number is 50–200. In the book-keeping algorithm (帳簿法), we make a table of j for each i and
use the table in FORCE. Of course we have to update the table after a while because particles can move
34) It is well known that a simple truncation for the Coulombic interaction does not work, because integration of 1/r in
the infinite 3D space diverges;
∫ ∞
1
4πr2 dr → +∞
(5-41)
r
r
c
In that case, special treatments such as the Ewald summation method are required. Also the particle-cell algorithm and
the reaction field method are often used. See textbooks for details.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 49
for (i=1;i<NATOM;i++) {
for (jt=0;jt<itmax[i];jt++) {
j=itable[i][jt];
dx=posx[i]-posx[j];
dy=posy[i]-posy[j];
dz=posz[i]-posz[j];
r2= dx*dx +dy*dy +dz*dz;
if (r2 < RCUT*RCUT) {
calculation of force and potential energy
}
} }
rcut
rtbl
Figure 5-28: Concept and a pseud-code of the book-keeping method.
and get out of or come into the interaction range. The update does take some additional time, but the
calculation load in FORCE drastically reduces in general; with the book-keeping algorithm, the calculation
time is expected to be proportional to N .
5.2.3
Cell-division method
For a very large system, the cut-off distance rc is possibly much smaller than the cell size L. In this
situation, the calculations in FORCE are very much “localized”, i.e., particles in different zone of the cell
do not at all interact each other. Thus, we can divide the simulation cell into small “sub-cells” of size
∼ rc , and the interactions are summed up in each sub-cell and its nearest neighbor sub-cells. This is
called a cell division algorithm (セル分割法). We need the cell division procedure every step, but the
load in FORCE decreases, almost ∝ N .
~rcut
for (icell=0;icell<NUMCELL;icell++) {
for (it=0;it<ntmax[icell];it++) {
i=ntable[icell][it];
// In the same subcell
for (jt=0;jt<it;jt++) {
j=ntable[icell][jt];
dx=posx[i]-posx[j];
dy=posy[i]-posy[j];
dz=posz[i]-posz[j];
r2= dx*dx +dy*dy +dz*dz;
if (r2 < RCUT*RCUT) {
calculation of force and potential energy
}
}
// between the neighbor subcells (13 in 3D case)
for (jc=0;jc<13;jc++) {
jcell=neighborcell[icell][jc];
for (jt=0;jt<ntmax[jcell];jt++) {
j=ntable[jcell][jt];
dx=posx[i]-posx[j];
dy=posy[i]-posy[j];
dz=posz[i]-posz[j];
r2= dx*dx +dy*dy +dz*dz;
if (r2 < RCUT*RCUT) {
calculation of force and potential energy
}
} }
}
Figure 5-29: Concept and a pseud-code of the cell division method.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 50
Time [second / 100 steps]
400
'speed.dat'
350
300
250
200
150
100
50
0
0
50000
100000
Number of Atoms
150000
Figure 5-30: Examples of the simulation time with the cell-division method; a LJ system with cut-off
distance rc = 3.5 σ, number density ρ = 1.0. Note that the time is almost proportional to N .
An example is shown in Fig. 5-30, which shows that the required time per 100 steps on Intel
Core2Quad (2.66GHz) is certainly proportional to the number of particles, as expected. Of course, the
actual time strongly depends on the simulation conditions, especially the number density (or the number
of neighboring particles) and the cut-off distance.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 51
5.3
From mathematical viewpoints
Larger ∆t is preferrable for efficient calculations, of course. To carry out a stable simulation with sufficient
accuracy, various techniques have been proposed.
5.3.1
Systematic time integration based on symplectic transformation
The equations of motion for a given Hamiltonian H(r, p) is

∂
d


q =
H

 dt
∂p


d


p
dt
In a vector formulation,
d
dt
(
q
p
)
[
=
(5-42)
∂
= − H
∂q
] (
)
(
)
∂H ∂
∂H ∂
q
q
−
·
≡ [Lk + Lu ] ·
p
p
∂p ∂q
∂q ∂p
Here Lk and Lu are the operators35)
36)
(5-43)
in the phase space (位相空間), defined as
Lk
Lu
∂H ∂
∂p ∂q
∂H ∂
≡ −
∂q ∂p
≡
(5-44)
(5-45)
Note that these are non-commutative (非可換), i.e.,
Lk Lu 6= Lu Lk
A formal solution of Eq. (5-43) is then
(
)
(
)
q(∆t)
q(0)
= exp [∆t (Lk + Lu )]
p(∆t)
p(0)
(5-46)
(5-47)
Expanding the exponential operator in the righthand side of Eq. (5-47) in terms of ∆t and truncating
it at some appropriate order, we have a series of symplectic form37) of numerical integration. For that
purpose, note that the exponential of the phase space operators is expressed by the definition as
exp [∆t (Lk + Lu )] =
1
+(∆t) (Lk + Lu )
1
(∆t)2 (Lk + Lu ) (Lk + Lu )
2!
1
+ (∆t)3 (Lk + Lu ) (Lk + Lu ) (Lk + Lu )
3!
+
+···
35)
(5-48)
L is sometimes called a “Liouville operator” (リウヴィユ演算子); refer to textbooks on analytical mechanics.
Joseph Liouville (1809–1882) French physicist, mathematician.
37) The word “symplectic” was used by Hermann Weyl (1885–1955, German mathematician, theoretical physicist), meaning
“complex”. “Complex” comes from a Latin word, while “symplectic” comes from the corresponding ancient Greek. Today
“symplectic” is used mainly in a field of mathematics, like symplectic manifold and symplectic geometry. (Wikipedia)
36)
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 52
1. First Order Symplectic Form
When the expansion is truncated at (∆t)1 , we have
)
(
exp [∆t (Lk + Lu )] = exp [∆tLu ] exp [∆tLk ] + O ∆t2
(5-49)
which is the 1st order symplectic form. The righthand side is a product of simple integrators, thus
)
(
)
(
q(∆t)
q(0)
= exp [∆tLu ] exp [∆tLk ]
p(∆t)
p(0)
(
)
q(0) + ∆tp(0)/m
' exp [∆tLu ]
p(0)
(
)
q(0) + ∆tp(0)/m
=
(5-50)
p(0) + ∆tF (∆t)
This is essentially equivalent to the naı̈ve difference form (or the Euler formula),

p(0)

 q(∆t) = q(0) + ∆t
m


p(∆t) = p(0) + ∆tF (0)
(5-51)
2. Second Order Symplectic Form
The next order has some arbitrariness, but the often-used form is
[
]
[
]
(
)
∆t
∆t
exp [∆t (Lk + Lu )] = exp
Lu exp [∆tLk ] exp
Lu + O ∆t3
2
2
(5-52)
Thus we obtain
(
q(∆t)
p(∆t)
)
[
]
[
](
)
∆t
∆t
q(0)
Lu exp [∆tLk ] exp
Lu
p(0)
2
2
[
)
]
(
∆t
q(0)
' exp
Lu exp [∆tLk ]
p(0) + ∆t
2
2 F (0)
[
]
[
](
)
∆t
∆t
q(0) + ∆t p(0) + 2 F (0) /m
= exp
Lu
p(0) + ∆t
F (0)
2
[
] 2 )
(
∆t
q(0) + ∆t p(0) + 2 F (0) /m
=
∆t
p(0) + ∆t
2 F (0) + 2 F (∆t)
=
exp
(5-53)
This is mathematically equivalent to the Verlet algorithm and also to the leap-frog one.
3. Higer Order Symplectic Forms
See textbooks.
The symplectic forms (of second order or larger) are known to be very stable even for larger ∆t, as
shown in Fig. 5-31.
This kind of acceleration technique is widely used for the recent multi-scale or multi-physics simulations. You should understand them more closely if you want to do such simulations.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 53
[出典] 上田顕:分子シミュレーション(裳華房, 2003).引用文献 [9.14] とは, 岡部恒康:物性研究 74-1 (2004).
Figure 5-31: Comparison among algorithms for numerical integration.
5.3.2
Multi time-step MD
If your target molecules contain several kinds of degrees of freedom which have a wide range of “characteristic times”, chances are that you can separate them and apply different time step ∆t for each degree of
freedom. For example, a water molecule has six degrees of freedom as a rigid rotor, as shown in Fig. 5-32.
Under normal conditions of liquid, the characteristic time is
τ ∼ 20 fs
:
translational motions (1 fs= 10−15 sec)
τ ∼ 10 fs
:
rotational motions around y- and z- axes
τ ∼ 2 fs
:
rotational motions around x-axis
y
x
Figure 5-32: A model of a water molecule. As a rigid rotor, it has six degrees of freedom, three for
translations and three for rotations.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 54
For standard MD simulations, a short time step of ∆t ' 0.5 fs is often used in order to trace the fastest
rotational motion with some accuracy. The multi time-step method allows you to use longer ∆t; for
example, you may use 5 fs for translational motions, which means that ten updates of rotational motions
during a single update for translational motions.
5.4
From computer’s viewpoints
Of course, using “faster computers” are a quick solution for the speed problem. There are several points
to be considered, however:
1. High-speed hardwares (e.g., vector processors, high-speed memory, network with broad bandwidth)
are very expensive in general.
2. The increase of single CPU speed may be slowered in the future; see Fig. 5-33.
3. The architecture of CPU becomes complicated, e.g., dual-, quad-, and multi-cores.
Among these topics, an introduction to “parallel computing 並列計算” is given in the following.
This was quite a challenging field a decade ago. Owing to the recent remarkable progresses of hardware
and software, however, molecular simulation in parallel computing environments has become more and
more popular.
Moore’s Law38) (Wikipedia)
Figure 5-33: The well-known Moore’s law states
that the number of transistors that can be placed
inexpensively on an integrated circuit has doubled approximately every two years. The speed
of single CPU is expected to be roughly proportional to the number.
38) (From Wikipedia) For the thirty-fifth anniversary issue of Electronics magazine, which was published on April 19,
1965, Gordon E. Moore, who was working as the director of research and development (R&D) at Fairchild Semiconductor
at the time, was asked to predict what was going to happen in the semiconductor components industry over the next
ten years. His response was a brief article entitled, “Cramming more components onto integrated circuits.” Within his
editorial, he speculated that by 1975 it would be possible to contain as many as 65,000 components on a single quarter-inch
semiconductor.
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term,
the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly
constant for at least 10 years.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 55
5.5
Paralell computing 1: Hardware side
There are two types of pararell computers, a distributed memory type (分散メモリ型) and a shared
memory one (共有メモリ型), as shown schematically in Fig. 5-34.
Parallel computer systems with distributed memory have a long history and usually show good
cost performance, but you should take care of data flow by yourself. On the other hand, shared memory
systems normally need special hardware for high-speed data transfer and have been relatively expensive.
Recently, however, multicore processors have become popular, and you can personally try parallel computing on a small-scale shared memory system.
CPU
CPU
CPU
Local Memory
Local Memory
Local Memory
CPU
CPU
CPU
CPU
Main Memory
Data
Bus
Data
Bus
Figure 5-34: Schematic view of (Left) a distributed memory system and (Right) a shared memory system.
5.5.1
Examples of distributed memory systems
PC clusters are the well-known example of distributed memory systems. Most of recently-developed
supercomputers are also based on the distributed memory architecture. All supercomputers listed in
Table 5-1 have large number of cores/nodes. The essential point (or a bottle neck) is the speed of network.
The Ethernet is the most common one and the speed has become as high as 10 Gbps (Giga bit per second),
but its performance is still insufficient. For PC clusters, the Myrinet39) and the InfiniBand40) are often
adopted.
39) (Wikipedia 日本語版より) Myrinet とは、Myricom 社製のネットワークシステムを言う。最新の Myri-10G Myrinet のス
ループットは全二重通信で 20Gbps であり、また通信レイテンシが低いことや通信エラー率が低いことから、現在 Myrinet はクラス
タコンピューティングのネットワークデバイスとして広く使われている。
40) (Wikipedia 日本語版より) InfiniBand とは、非常に高い RAS(信頼性・可用性・保守性) を持つ基幹系・HPC 系のサーバ/ク
ラスター用高速 I/O バスアーキテクチャ及びインターコネクトのこと。一本の Infiniband レーンは最大 2.5Gbps 双方向の転送速度を
持つ。Infiniband の接続を 4 本 (4X) もしくは 12 本 (12X) 束ねて高い転送速度を実現することができる。12X において QDR(Quad
Data Rate) でデータ転送を行った場合、120Gbps(raw) あるいは 96Gbps(データ転送)となる。
41) (Wikipedia より) LINPACK とは線形代数学の数値演算を行うソフトウェアライブラリであり,米国アルゴンヌ国立研究所で
FORTRAN ライブラリとして開発された.1970 年代から 1980 年代初期のスーパーコンピュータを対象として設計され,その後,
より洗練されたライブラリ LAPACK に取って代わられた.LINPACK ベンチマーク は この LINPACK に基づいたベンチマークプロ
グラムで,システムの浮動小数点演算性能を評価する.ジャック・ドンガラ (Jack Dongarra, 米国のコンピュータ科学者) が考案し
たもので,理学・工学で一般的な n 次元線型方程式系 A~
x = ~b を解く速度を測定する.このベンチマークの最新版は TOP500 で世
界の高速なコンピュータの性能値としてランキングに使用されている.1 つの測定値でコンピュータシステムのあらゆる性能を代表さ
せることは不可能であり,あくまでも 1 つの単純化である.LINPACK ベンチマークで測定するのは,64 ビット浮動小数点演算(通
常,加算と乗算)の 1 秒あたりの実行回数(FLOPS)である.実アプリケーションを実行したときの性能は,LINPACK ベンチマー
クを適切に設定して測定したときの最高性能よりずっと低くなる可能性が高い.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 56
Table 5-1: From a recent Top500 list (November, 2014). For details, see http://www.top500.org/
An European organization has provided a ranking of supercomputers in the world twice a year, based on the LINPACK
Benchmark.41)
Rank
Name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Computer
Si te
Country
TH-IVB-FEP Cluster, Intel Xeon E5-2692National Super Computer Center inChina
Cray XK7 , Opteron 6274 16C 2.200GHz
DOE/SC/Oak Ridge National Labora
United States
BlueGene/Q, Power BQC 16C 1.60 GHz,
DOE/NNSA/LLNL
C
United States
K computer, SPARC64 VIIIfx 2.0GHz, To
RIKEN Advanced Institute for Compu
Japan
BlueGene/Q, Power BQC 16C 1.60GHz,DOE/SC/Argonne
C
National Laboratory
United States
Mira
Cray XC30, Xeon E5-2670 8C 2.600GHz
Switzerland
Swiss National Supercomputing Cen
Piz Daint
PowerEdge C8220, Xeon E5-2680 8C 2.700G
Texas Advanced Computing Center/
United States
Stampede
BlueGene/Q, Power BQC 16C 1.600GHz
Forschungszentrum Juelich (FZJ) Germany
JUQUEEN
BlueGene/Q, Power BQC 16C 1.600GHz
DOE/NNSA/LLNL
United States
Vulcan
Cray CS-Storm, Intel Xeon E5-2660v2 10
Government
United States
SGI ICE X, Intel Xeon E5-2670/E5-2680v
Pleiades
NASA/Ames Research Center/NASUnited States
iDataPlex DX360M4, Intel Xeon E5-2680v
Exploration & Production - Eni S.p.A.
Italy
HPC2
Cray XC30, Intel Xeon E5-2697v2 12C 2.
Government
United States
iDataPlex DX360M4, Xeon E5-2680 8C Leibniz
2
SuperMUC
Rechenzentrum
Germany
Cluster Platform SL390s G7, Xeon X5670
GSIC Center, Tokyo Institute of Tec
Japan
TSUBAME 2.5
Cray XC40, Xeon E5-2680v3 12C 2.5GHz
HLRS - Höchstleistungsrechenzentru
Germany
Hornet
NUDT YH MPP, Xeon X5670 6C 2.93 GHz
National Supercomputing Center inChina
Tianhe-1A
Atipa Visione IF442 Blade Server, XeonDOE/SC/Pacific
E5-2670
Northwest NationalUnited States
cascade
Cray XC40, Xeon E5-2698v3 16C 2.3GHz
Army Research Laboratory DoD Super
United States
Excalibur
SGI ICE X, Xeon E5-2670 8C 2.600GHz,
Total
I Exploration Production
France
Pangea
Tianhe-2 LvLiang SolutionTianhe-2 LvLiang, Intel Xeon E5-2692v2LvLiang Cloud Computing Center China
T-Platform A-Class Cluster, Xeon E5-2697
Moscow State University - Research
Russia
BlueGene/Q, Power BQC 16C 1.60GHz,CC
INECA
Italy
Fermi
Cray XC30, Intel Xeon E5-2695v2 12C 2.
DOE/SC/LBNL/NERSC
United States
Edison
Cray XC30, Intel Xeon E5 v2 12C 2.700G
EPSRC/University of Edinburgh United Kingdom
ARCHER
bullx DLC, Xeon E5-2690v3 12C 2.6GHz
Grand Equipement National de Calc
France
Occigen
Power 775, POWER7 8C 3.836GHz, Cust
IBM Development Engineering
United States
Cray XC30, Intel Xeon E5-2697v2 12C 2.
ECMW F
United Kingdom
Cray XC30, Intel Xeon E5-2697v2 12C 2.
United Kingdom
ECMW F
BlueGene/Q, Power BQC 16C 1.60GHz,Science
C
and Technology Facilities United
C
Kingdom
Blue Joule
SGI ICE X, Xeon E5-2670 8C 2.600GHz,
AirI Force Research Laboratory
United States
Spirit
Cray XC40, Xeon E5-2698v3 16C 2.3GHz
KTH - Royal Institute of TechnologySweden
Beskow
Bullx B510, Xeon E5-2680 8C 2.700GHz,
I
CEA/TGCC-GENCI
France
Curie thin nodes
iDataPlex DX360M4, Intel Xeon E5-2680v
Max-Planck-Gesellschaft MPI/IPP Germany
Dawning TC3600 Blade System, Xeon X5650
National Supercomputing Centre inChina
Nebulae
iDataPlex DX360M4, Xeon E5-2670 8C NCAR
2
(National Center for Atmospheri
United States
Yellowstone
Cray XC40, Xeon E5-2690v3 12C 2.6GHz
CSC (Center for Scientific Computing)
Finland
Sisu
Bullx B510, Xeon E5-2680 8C 2.700GHz,
I
International
Fusion Energy Researc
Japan
Helios
Cray XE6, Opteron 16C 2.500GHz, CrayERDC DSRC
United States
Garnet
Cielo
Cray XE6, Opteron 6136 8C 2.40GHz, CDOE/NNSA/LANL/SNL
United States
Australia
Cray XC40, Xeon E5-2690v3 12C 2.6GHz
Pawsey Supercomputing Centre, Ken
Magnus
BlueGene/Q, Power BQC 16C 1.60GHz,CNRS/IDRIS-GENCI
C
France
Turing
BlueGene/Q, Power BQC 16C 1.60GHz,University
C
DiRAC
of Edinburgh
United Kingdom
Cray XE6, Opteron 6172 12C 2.10GHz,DOE/SC/LBNL/NERSC
C
Hopper
United States
Bullx DLC B710/B720 Blades, Xeon E5-2695
Cartesius
SURFsara
Netherlands
Dell C8220X Cluster, Intel Xeon E5-2680
Louisiana Optical Network InitiativeUnited States
QB-2
Commissariat a l'Energie AtomiqueFrance
Tera-100
Bull bullx super-node S6010/S6030
PRIMEHPC FX10, SPARC64 IXfx 16C 1.
Information Technology Center, The
Japan
Oakleaf-FX
HA8000-tc HT210/PRIMERGY CX400 Clu
QUARTETTO
Research Institute for Information Te
Japan
United States
Discover SCU10
Rackable Cluster, Xeon E5-2697v3 14CNASA/Goddard Space Flight Center
Tianhe-2 (MilkyWay-2)
Titan
Sequoia
Year
2013
2012
2011
2011
2012
2012
2012
2012
2012
2014
2011
2014
2014
2012
2013
2014
2010
2013
2014
2013
2014
2014
2012
2014
2014
2014
2013
2014
2014
2012
2012
2014
2012
2013
2010
2012
2014
2011
2010
2011
2014
2012
2012
2010
2013
2014
2010
2012
2013
2014
Total Cores
3120000
560640
1572864
705024
786432
115984
462462
458752
393216
72800
160768
72000
225984
147456
76032
94608
186368
194616
100064
110400
174720
37120
163840
133824
118080
50544
62944
83160
83160
131072
73584
53632
77184
65320
120640
72288
40608
70560
150528
142272
35712
98304
98304
153408
38880
23040
138368
76800
222072
31248
Rmax
33862700
17590000
17173224
10510000
8586612
6271000
5168110
5008857
4293306
3577000
3375600
3188000
3143520
2897000
2785000
2763000
2566000
2539130
2485000
2098090
2071390
1849000
1788878
1654702
1642536
1628770
1587000
1552000
1552000
1431102
1415470
1397000
1359000
1283311.9
1271000
1257615
1250000
1237000
1167000
1110000
1097558
1073327
1073327
1054000
1052510
1052000
1050000
1043000
1018000
1000520
Power
Mflops/Wa
Architecture Processor
17808
8209
7890
12659.89
3945
2325
4510
2301
1972
1498.9
3140.7
1227
1901.54
2142.77
2176.58
830.18
2176.58
2697.2
1145.92
2176.82
2177.13
2386.42
1074.79
2598.21
3422.67
1398.61
1512
4040
1384
846.42
1991.26
1827.38
635.15
1834.63
2118
997
990.6
2077.62
821.88
2176.57
934.8
3575.63
1742.37
443.84
657
1606
2178.24
881.36
2251
1260
2580
1436.72
680
2200
603.73
1018.5
492.64
875.34
1838.24
562.27
3980
278.89
493
493.12
2910
470.4
500
4590
1176.8
2177.13
2176.6
362.2
2237.48
2104
228.76
886.3
Cluster
MPP
MPP
Cluster
MPP
MPP
Cluster
MPP
MPP
Cluster
Cluster
Cluster
MPP
Cluster
Cluster
Cluster
MPP
Cluster
Cluster
Cluster
Cluster
Cluster
MPP
MPP
MPP
Cluster
MPP
MPP
MPP
MPP
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
MPP
MPP
Cluster
MPP
MPP
MPP
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Accelerator/Co-Process
Intel Xeon E5-2692v2 12C
Intel Xeon Phi 31S1P
Opteron 6274 16C 2.2GHz
NVIDIA K20x
Power BQC 16C 1.6GHzNone
SPARC64 VIIIfx 8C 2GHzNone
Power BQC 16C 1.6GHzNone
NVIDIA K20x
Xeon E5-2670 8C 2.6GHz
Intel Xeon Phi SE10P
Xeon E5-2680 8C 2.7GHz
Power BQC 16C 1.6GHzNone
Power BQC 16C 1.6GHzNone
Nvidia K40
Intel Xeon E5-2660v2 10C
None
Intel Xeon E5-2680v2 10C
Intel Xeon E5-2680v2 10C
NVIDIA K20x
Intel Xeon E5-2697v2 12C
None
Xeon E5-2680 8C 2.7GHz
None
Xeon X5670 6C 2.93GHzNVIDIA K20x
None
Xeon E5-2680v3 12C 2.5G
Xeon X5670 6C 2.93GHzNVIDIA 2050
Xeon E5-2670 8C 2.6GHz
Intel Xeon Phi 5110P
Xeon E5-2698v3 16C 2.3G
None
Xeon E5-2670 8C 2.6GHz
None
Intel Xeon Phi 31S1P
Intel Xeon E5-2692v2 12C
Xeon E5-2697v3 14C 2.6G
Nvidia K40m
Power BQC 16C 1.6GHzNone
Intel Xeon E5-2695v2 12C
None
Intel Xeon E5-2697v2 12C
None
None
Xeon E5-2690v3 12C 2.6G
POWER7 8C 3.84GHz None
Intel Xeon E5-2697v2 12C
None
Intel Xeon E5-2697v2 12C
None
Power BQC 16C 1.6GHzNone
None
Xeon E5-2670 8C 2.6GHz
Xeon E5-2698v3 16C 2.3G
None
Xeon E5-2680 8C 2.7GHz
None
Intel Xeon E5-2680v2 10C
None
Xeon X5650 6C 2.66GHzNVIDIA 2050
Xeon E5-2670 8C 2.6GHz
None
Xeon E5-2690v3 12C 2.6G
None
Xeon E5-2680 8C 2.7GHz
None
Opteron 16C 2.5GHz
None
Opteron 6136 8C 2.4GHzNone
Xeon E5-2690v3 12C 2.6G
None
Power BQC 16C 1.6GHzNone
Power BQC 16C 1.6GHzNone
Opteron 6172 12C 2.1GHz
None
Xeon E5-2690v3 12C 2.6G
None
Intel Xeon E5-2680v2 10C
NVIDIA K20x
Xeon X7560 8C 2.26GHzNone
SPARC64 IXfx 16C 1.85G
None
Xeon E5-2680 8C 2.7GHz
NVIDIA K20/K20x, Xeon Phi 5110P
Xeon E5-2697v3 14C 2.6G
None
Note
FLOPS stands for Floating point number Operations per Second (浮動小数点演算回数), and is used as
an index of computing speed. It is reported that the recent Core 2 Quad shows about 50 GFLOPS and
the PS3 (Cell) has 2 TFLOPS as the whole system. Devloping PFLOPS (=1015 FLOPS) computeres
has become popular projects world-wide.
5.5.2
Examples of shared memory system
The operation speed in a single CPU essentially depends on the clock frequency(or clock rate42) ), which is
determined by how fast the electronic state can change in each element. As the clock frequency increases,
the energy dissipation (and heat emittion) cannot be negligible, and the speed on a sigle chip seems to
become saturated recently. Thus the idea of multi-core CPU becomes popluar, where a several nodes
share a single in-processor memory.
42) (Wikipedia) Historical milestones and current records of Clock Rate: The first general purpose computer, the ENIAC,
used a 100 kHz clock in its cycling unit. As each instruction took 20 cycles, it had an instruction rate of 5kHz. The first
commercial PC, the Altair 8800 (by MITS), used an Intel 8080 CPU with a clock rate of 2 MHz. The original IBM PC (c.
1981) had a clock rate of 4.77 MHz. In 1992, both Hewlett-Packard and Digital Equipment Corporation broke the difficult
100 MHz limit with RISC techniques in the PA-7100 and AXP 21064 DEC Alpha respectively. In 1995, Intel’s Pentium
chip ran at 100 MHz. In March 6, 2000, AMD reached the 1 GHz milestone a few months ahead of Intel. In 2002, an Intel
Pentium 4 model was introduced as the first CPU with a clock rate of 3 GHz. Since then, the clock rate of production
processors has increased much more slowly, with performance improvements coming from other design changes. As
of 2011, the Guinness Record for fastest CPU is by AMD with a Bulldozer based FX chip “overclocked” to 8.308 GHz;
however, it has now been superseded by the next generation of AMD’s FX chips “Piledriver” with a clock rate of 8.429
GHz. As of mid-2013, the highest clock rate on a production processor is the IBM zEC12, clocked at 5.5 GHz, which was
released in August of 2012.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 57
Figure 5-35: Concept of the Next-Generation Supercomputer.
Data are taken from http://www.nsc.riken.jp/project/spec.html
(Wikipedia 日本語版より)
マルチコア (multiple core, multi-core) は,1 つのプロセッサ・
パッケージ内に複数のプロセッサ・コアを封入した技術であり,
マルチプロセッシングの一形態である.外見的には 1 つのプロ
セッサでありながら内部的には複数のプロセッサとして認識さ
れるため,主に並列処理を行わせる環境下においては,プロセッ
サ・チップ全体での処理能力を上げ性能向上を果たすために行
われる.このプロセッサ・パッケージ内のプロセッサ・コアが
2 つであればデュアルコア,4 つであればクアッドコア,6 つで
あればヘキサコア,8 つではオクタルコア又はオクタコアと呼
ばれる.高性能な専用プロセッサの中には数十個ものコアを持
つものがあり,メニーコア (many core) と呼ばれる.従来の 1
つのコアを持つプロセッサはマルチコアに対してシングルコア
と呼ばれる.
シングルダイ・マルチコアの一例の概念図。この場合、プ
ロセッサ・コアとレベル 1 キャッシュが 2 つあり、レベ
ル 2 キャッシュは 2 つのコアと共有される。
(例)
・Intel の Xeon: 4 または 8 コア (スレッド数はその倍)
・Intel の Core i5, i7: 4 または 8
・AMD の Opteron: 2, 4, 6, 8, 12 コア
In general, the shared structure requires specially-designed hardware, and large-scale shared one
can be very expensive. From the programing viewpoints, however, all nodes have the same data (or
can access the same data easily), and parallel codes are developed more easily than for distributed
systems.
5.5.3
Homogeneous vs. heterogeneous
Various multi-core architectures exist; “homogeneous cores” consist of several cores of the same architecture, while “heterogeneous” ones contains different types of cores. Here given are three examples:
1. Cell processor, developed by IBM, Sony, and Toshiba.
2. graphic processing unit (gpu), GeForce by NVIDIA.
3. Xeon Phi
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 58
Wikipedia(日本語版) より
Cell はマルチコアに分類され,ひとつのマイクロプロセッサに 9 つのコアを持っている.内訳は 1 個の汎用的なプロセッサコアと 8 個のシンプルなプ
ロセッサコアを組み合わせたヘテロジニアスマルチコアという形態をもつ.汎用プロセッサコアは PowerPC Processor Element (PPE) と呼ばれる
制御を担うコアで,8 個のコアは Synergistic Processor Element (SPE) と呼ばれる演算を担うコアである.PPE,SPE 共に従来のパソコン向け
のマイクロプロセッサである PowerPC G5 や Pentium 4,Athlon 64 といった高度なアウト・オブ・オーダー実行機能や分岐予測機構を持つマイ
クロプロセッサとは異なり,命令を並び替えたりするような複雑なスケジューリング機構を搭載しないことでコアを単純化し,高クロック化を実現して
いる.そのため,複雑な条件分岐を伴う整数演算能力はパソコン向けのマイクロプロセッサと比べ劣るが,強力で高度に並列化された演算機能を備え,
数値解析やシミュレーション,動画,音声処理などにおいては,複数のコアを並列に動作させることによって,Cell の性能を発揮させることができる.
Cell architecture, taken from http://www.ibm.com/developerworks/power/cell/documents.html
(Wikipedia より抜粋)
GPGPU
GPGPU (General-purpose computing on graphics processing units, GPU による汎目的計算) とは、GPU の演算資源を画像処理以外の目的
に応用する技術のことである。
概要
GPU は一般的に画像処理を専門とする演算装置であり、多くの場合、CPU と呼ばれる主演算装置の制御の下で用いられる動画信号生成専用の補助演算
用 IC である。動画像の実時間内での生成は高負荷な演算能力が要求されるが、その多くが定式化された単純な演算の繰り返しであるためハードウェア
化に向いており、GPU を製造している半導体メーカー数社からは、高速なメモリ・インターフェース機能と高い画像演算能力を備えた IC 製品のシリー
ズがいくつも製造・販売されている。特に 90 年代中盤以降は 3D 描画性能が劇的に向上し、それに伴い行列演算を中心とした SIMD 演算機の色彩が強
くなってきた。そこでこれをグラフィック・レンダリングのみならず、他の数値演算にも利用するのが GPGPU のコンセプトである。GPU の浮動小
数点演算能力は 2005 年頃に 100GFLOPS をオーバーした一方で、CPU の浮動小数点演算能力は頭打ち状態にあり、2007 年時点で数十 GFLOPS
の水準に留まっている。GPU は構成が単純であるために浮動小数点演算での効率がよく、GPU 専用にローカル接続されたメモリ IC とのメモリバ
ンド幅を広く備えるために、CPU と比べて性能比で安価かつ成長の伸び率が高い。DirectX 9.0 の登場以降、NVIDIA による GPGPU 専用の統合
開発環境「CUDA」や、AMD の「ATI Stream」が現われ、GPGPU 活用の幅が広がりつつある。また NVIDIA は自社の GPU「G80」をベー
スにした HPC 向け GPGPU「Tesla」を投入するなど、従来のベクトル計算機からの置き換えを視野に置いた市場展開を行っている。
特徴と課題
条件分岐
GPU はシェーダプロセッサと呼ばれる多数の演算ユニットを持ち、複数のシェーダプロセッサをまとめてクラスタとしている。これらの演算器に命令を
与えるインストラクション・ユニットはクラスタごとに 1 台しか無く、クラスタを構成するシェーダプロセッサはそれぞれ異なるデータを与えられ、そ
のデータに対して同じ命令内容を一度に実行する。このような SIMD (single instruction – multiple data) 型データ処理は 3 次元演算やマルチメ
ディア処理に効果を発揮する一方で、命令中に条件分岐による分岐が入るとオーバーヘッドがかさみ、途端に効率を落としてしまう。原理的にはシンプ
ルなアルゴリズム構造を持ったプログラムによって、並列データ処理に最適化することが GPGPU の特長を最大限引き出すことにつながるといえる。
浮動小数点演算とローカルメモリ
2010 年に発売された NVIDIA Tesla 20 シリーズから倍精度も単精度の半分の速度で実行できるようになったが、GPU は倍精度の浮動小数点演算
が不得意である。GPU が扱う多くの画像演算では、整数演算や単精度の浮動小数点演算で足りてしまうためである。基本的に GPU は、配列構造の単
純なデータを単精度程度の浮動小数点演算によって順番に処理することで 2 次元の動画像データを実時間内に生成することに特化しているため、それ以
外の用途ではあまり高い性能は期待できない。画像処理専用 IC の流用では、科学技術計算でも倍精度以上の浮動小数点演算を必要としたり、演算の局
所性が低いものではそれほど高い性能は得られない。画像処理専用ではなく、GPU から派生して新たに開発された GPGPU 用の IC では、倍精度浮
動小数点演算やより広いメモリ空間に対応したものがあり、これらは広範な科学技術計算への利用が期待される。
適合分野
GPGPU で性能の向上が期待されるアプリケーションとしては、
・高性能クラスタ (HPC)
・物理シミュレーション
・データベース処理
・フーリエ変換
・格子ボルツマン法
・音声処理
・暗号解読
・デジタル画像処理
・市況分析
・ビデオ変換処理
・天体シミュレーション
・レイ・トレーシング
・機械学習
・分子動力学法
・最適化問題
・流体計算・気候シミュレーション
・Constructive Solid Geometry
(例) NVIDIA 社の GeForce GTX590
などがある。
プログラムモデルは今のところやや複雑であり、性能を引き出すには GPU ないし並列演算処理への知識が要求される。OpenCL の登場により、プログ
ラマにとってマルチコア及び GPGPU の演算能力をシームレスに活用できる環境が整いつつあり、GPGPU に対応したライブラリが広く普及するこ
とが期待される。
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 59
Wikipedia(日本語版) より
Xeon Phi は Larrabee(インテルの社内コード) より派生した MIC
アーキテクチャ(Many Integrated Core) のブランド名であり,X86
互換のコプロセッサを搭載した,並列コンピューティング用の演算ボー
ドである.第一製品群のターゲットは HPC 分野であるが,将来的に
は企業のデータセンタやワークステーションなどにも対応する.Xeon
Phi コプロセッサは従来の IA アーキテクチャのアプリケーションを
そのまま使うことができる事が最大の売りである.そのほかにもホス
ト OS から独立した Linux ベースの OS を動作させることができる.
また 512Bit の SIMD 命令をサポートしており,一回の命令で複数
のデータを処理することで性能が向上する.
製品は PCI Express 形式の拡張カードで供給され,ラスタライザや
ビデオ再生エンジン,ディスプレイ出力は存在しない純粋な演算器で
ある.本製品の競合相手は NVIDIA 社の GPGPU である Tesla.
XeonPhi の強みはコアが X86 であるために,X86CPU 向けに記
述されたプログラムをほぼそのまま利用できる点である.また PCI
Express で接続されてはいるものの,本製品は内部でスタンドアロン
型の Linux が動作しており,ホストから SSH を使ってログインす
ることが出来る.これは独立した OS が動作しえない GPGPU では
不可能である.
ある計算機メーカーのホームページより
http://www.hpc.co.jp/xeon phi product.html
製品開発の背景
従来,高性能計算機では 6∼8 コア CPU を 2 個もしくは 4 個搭載したサーバーを複数接続したクラスタ型
計算機が多用されてきました.この構成は全体を比較的安価に調達できる一方で,設置スペースが大きくなってし
まう,消費電力が大きくなってしまう,という特性がありました.そこで近年,集積度の高い並列計算デバイスが
注目されています.NVIDIA 社の HPC 向けグラフィックカード Tesla シリーズとそのプログラミング環境
CUDA が牽引役となり,普及が進んでいます.2012 年にはインテル社が HPC 向けコプロセッサー Phi シ
リーズを発表しました.Phi は,従来の Xeon CPU と同様に x86 アーキテクチャーがベースとなるプロ
セッサで構成され,x86 用に書かれたプログラミング言語や並列モデル,開発ツールなどをそのまま利用して開
発を行うことができます.
Phi 5110P の仕様
コア数:60
コア周波数:1053 MHz
単精度浮動小数点性能:2.02 TFLOPS
倍精度浮動小数点性能:1.01 TFLOPS
5.6
Parallel computing 2: Software side
Most of the recent OS’s (Linux, Windows, etc.) support parallel codings. For ordinary users, it would be
convenient to utilize some “interface” softwares (sometimes called “tools”, “middle wares”, or “libraries”),
such as
• pthread: POSIX-standard thread library. Windows versions also exist.
• PVM: Parallel virtual machine. It works on various OS.
• MPI: Message Passing Interface. It also works on various OSs. It is library-based, works independently of programing language. Free version, MPICH, is very popular.
• OpenMP: mainly developed for shared-memory computers. Parallelization is realized by inserting
directives(指示行), thus the same code can be used both in parallel and non-parallel envirionments.
• CUDA, OpenCL: CUDA (compute unified device architecture) is a developing environment for
GPU provided by NVIDIA. OpenCL (open computing language) is a framework for parallel computing, espcecially for multicore CPU, GPU, and
Cell processors. OpenCL 1.2 was the most recent version (Nov. 15, 2011). NVIDIA recently
annouced to adopt OpenCL for CUDA.
CUDA の処理の流れ (Wikipedia より)
1. 主記憶からデータを GPU 用メモリにコピーする。
2.CPU が GPU に対して処理を指示する。
3.GPU が必要なデータを取り込み各コアで並列実行する。
4. 結果を GPU 用メモリから主記憶にコピーする。
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 60
Wikipedia (日本語版) より
OpenMP は,並列コンピューティング環境を利用するために用いられる標準化された基盤.主に共有メモリ型並列計算機で用いられ
る.MPI では明示的にメッセージの交換をプログラム中に記述しなければならないが,OpenMP は OpenMP が使用できない環境
では無視されるディレクティブを挿入することによって並列化を行う.このため並列環境と非並列環境でほぼ同一のソースコードを
使用できるという利点がある.MPI との比較では,OpenMP は異なるスレッドが同一のデータを同じアドレスで参照できるのに対
して,MPI では明示的にメッセージ交換を行わなければならない.そのため SMP 環境においては大きなデータの移動を行なわずに
すむので高い効率が期待できる.ただし並列化の効率はコンパイラに依存するのでチューニングによる性能改善が MPI ほど高くなら
ないという問題がある.また,OpenMP は MPI に比べてメモリアクセスのローカリティが低くなる傾向があるので,頻繁なメモリ
アクセスがあるプログラムでは,MPI の方が高速な場合が多い.現在 FORTRAN と C/C++ について標準化が行われている.
5.6.1
Let’s try OpenMP —Brief introduction for beginners—
This is for (Windows + multicore) users.
1. Construct OpenMP environment for free by using Visual Studio Express Edition.
(a) Find the site for Visual Stuidio Express. Probably, search with keywords (e.g., “visual studio express
2008”) will work. Download the Setup.exe through Web Setup.
I recommend to install Visual Studio 2008 or the latest Visual Studio 2013, instead of the 2010 or
2012 version. I do not why, but the OpenMP environment seems not to be included in the 2010 express
edition.
The following is an example of installing process; the detail may depend on your system.
(b) Install the necessary components. Probably you may not need huge amount of documentations and
samples.
(c) Wait several minutes until the installation is finished. Now you will have a command shell to compile
C programs.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 61
→
#include <omp.h>
#include <stdio.h>
int main()
{
#pragma omp parallel
printf("Hello from thread %d, nthreads %d\n",
omp_get_thread_num(), omp_get_num_threads());
return 0;
}
Figure 5-36:
Sample program hello.cpp using OpenMP. This test code was taken from
http://openmp.org/wp/openmp-compilers/
(d) Just compile and execute a test code. You have to compile this program (hello.cpp) with /openmp
option, as
>cd プログラムファイルのあるフォルダ
>cl /openmp hello.cpp
・
・
・
> hello.exe
Hello from thread 0, nthreads 4
Hello from thread 1, nthreads 4
Hello from thread 2, nthreads 4
Hello from thread 3, nthreads 4
Congraturation! You certainly have an OpenMP environment. If you have nthreads 4, you must
have a Quad System.
2. A simple example of load distribution parallelization:
Consider an infinite series (無限級数) for example;
I=
∞
∑
(−1)n
0
2n + 1
=1−
1
1
π
+ − ··· =
3
5
4
(5-54)
A simple code is given in Fig. 5-37.
A naı̈ve parallelization code for using dual nodes is shown in Fig. 5-38, and results are shown in Fig. 5-39.
The results (I=3.141592...) are accurate enough, but unfortunately the computation time is more than
doubled... This is probably due to the overhead time for parallelization.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 62
#include <stdio.h>
#include <time.h>
#define NMAX 200000000
int main( )
{
int i, sg;
double sum=0.0;
clock_t t1,t2;
sg=1;
t1=clock( );
for (i=0;i<NMAX;i++) {
sum += sg/(double)(2*i+1);
sg*=-1;
}
t2=clock( );
printf("I = %15.12f\n",4*sum);
printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC);
return 0;
}
Figure 5-37: Original code series1.cpp for a series calculation. The header time.h is included for time
measurement.
#include <stdio.h>
#include <time.h>
#include <omp.h>
#define NMAX 200000000
int main( )
{
int i,j;
double sum1,sum2;
clock_t t1,t2;
t1=clock( );
sum1=sum2=0.0;
#pragma omp parallel sections
{
#pragma omp section
for (i=0;i<NMAX;i+=2) sum1 += 1/(double)(2*i+1);
#pragma omp section
for (j=1;j<NMAX;j+=2) sum2 -= 1/(double)(2*j+1);
}
t2=clock( );
printf("I = %15.12f\n",4*(sum1+sum2));
printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC);
return 0;
}
Figure 5-38: A code of manual load division series2.cpp.
Figure 5-39: Example of executing the parallel code.
3. A little bit elaborated version:
Instead of manual parallelization, we can utilize
#pragma omp paralell for
directive43) , which automatically parallelize the code, depending on the environment. A sample code and
its results are shown in Fig. 5-41. The computation time is significantly reduced!
43) directive とは,コンパイラに与える指示(指示行)のことであり,C 言語では文頭を#で始める.これまでに使ってきた,#include
や #define も実は directive である.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 63
#include <stdio.h>
#include <time.h>
#include <omp.h>
#define NMAX 200000000
int main( )
{
int i;
double sum=0.0;
clock_t t1,t2;
t1=clock( );
#pragma omp parallel for reduction(+:sum)
for (i=0;i<NMAX;i++)
{
int sg=1;
if (i%2==1) sg=-1;
sum += sg/(double)(2*i+1);
}
t2=clock( );
printf("I = %15.12f\n",4*sum);
printf("%10.5f sec on single node\n",(double)(t2-t1)/CLOCKS_PER_SEC);
return 0;
}
Figure 5-40: series3.cpp using paralell for directive.
Figure 5-41: Results of series3.cpp
4. Of course, you can do much more with OpenMP. For more information, see references
and various online manuals. For example, 菅原清文 「C/C++ プログラマーのための
OpenMP」(カットシステム, 2009).
5.7
Parallel Code of MD Simulation
Now let’s turn our attension to our MD code. As the system size increases (e.g., with N ≥ 10, 000), you
may need to execute your MD simulation on a multi-core PC.
5.7.1
Amdahl’s law
In most cases, a complete paralellization of simulation codes is not possible and not necessary. As
described in § 2, a typical MD code has the following structure.
Initialize
Force Calculation
(~
r, p
~) Update
negligible time
repeated many times, heavily time-consuming, ∝ N α
repeated, many times lightly time-consuming, ∝ N
Statistics
repeated many times
Finalilze
neglibible time
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 64
Thus we should focus first on the force part, and then on the move part. You have to have the
so-called Amdahl’s law (アムダールの法則),44) in mind.
アムダールの法則
(Wikipedia 日本語版 より)
並列化できない順次実行部分の実行時間の割合を F としたとき、並列化可
能な部分は (1 − F ) であり、N 個のプロセッサを使ったときの全体の性能
向上率は次の式で表される。
1
F+
1−F
N
N が無限大に近づく極限では、性能向上率は 1/F となる。実際、 1−F
が
N
F に比べて十分に小さくなると、N を増やしても性能が向上しなくなる(価
格性能比が悪くなる)。
例として、F がわずか 10%ならば N をどれだけ大きくしても性能向上は 1
プロセッサの 10 倍までで頭打ちとなる。このため、並列コンピューティン
グが有効であるのは、プロセッサ数が少ない場合か、適応領域の問題の F
の値が極めて小さい場合(embarrassingly parallel と呼ぶ)に限られる。
並列コンピューティングのプログラミング技法の多くは、F を可能な限り
小さくするためのものである。一般に、デスクトップ・コンピューティング
はアムダールの法則のためにマルチプロセッシングが役に立たないと言わ
れ、マルチコア化されたマイクロプロセッサの導入も疑問視されていたが、
インテルは将来的にデスクトップのアプリケーションの並列化が進むとし
てマルチコアの導入に積極的である。
5.7.2
並列コンピューティングで複数のプロセッサを使って
性能向上を図る場合、対象プログラムの並列化でき
ない部分の割合に大きく左右される。例えば、プログ
ラムの半分 (0.5) が並列実行できない場合、理論上
の性能向上限界は 2 となる(1/(0.5+(1−0.5)/N )
で N を極限まで大きくした場合)。
Particle parallel vs. Region parallel
As seen in a sample code of series calculation (series1.cpp or series2.cpp), parallelization in data
level is not very difficult. This corresponds to a particle parallel scheme, because the loop index in force
part is that of particles in general.
Another scheme is possible, when considering that the particles are distributed in a (sometimes huge) physical space.
As shown in Fig. 5-42, particles in one “region” have essentially no interaction with those in a different region, so the
accumulation of forces can be executed independently. This
is called a region parallel scheme; it is very natural, applicable both on distributed-memory systems and on sharedmemory ones, and generally highly efficient for large systems. However, the region boundary should be properly
taken care of.
Figure 5-42: Idea of the region-parallel
scheme.
44) (from Wikipedia) Also known as Amdahl’s argument, is named after computer architect Gene Amdahl (1922–, U.S.A.),
and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It
is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 65
5.7.3
Sample code
Following is a sample code with “particle parallel” scheme using OpenMP, where I have parallelized the
inner loop of the dual for loop with
#pragma omp for
directives (指示行).45)
46)
For details of #pragma omp directives, see references. Unfortunately, this
code is not very efficient, probably due to its small grain size (粒度) of the inner loop. You can find a bit
more ellaborated version on my web page.
//
// A Simple MD Code for Lennard-Jones System
//
mass=1; epsilon=1; sigma=1;
// Version 05: Potential Truncation at RCUT
//
Parallel with OpenMP
//
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
#define NUM_LATTICE 20
#define NUM_ATOM
(NUM_LATTICE*NUM_LATTICE*NUM_LATTICE)
#define TOTAL_STEP
#define SAVE_STEP
double
double
double
double
void
void
void
void
void
void
void
100
10
DEL_T = 0.001;
DENSITY = 1.0;
CELL_X, CELL_Y, CELL_Z;
RCUT
= 4.0;
initial( );
force( );
move( );
statistics(int);
initplot( );
finalplot( );
scaling( );
// Function Prototypes
double posx[NUM_ATOM], posy[NUM_ATOM], posz[NUM_ATOM];
double momx[NUM_ATOM], momy[NUM_ATOM], momz[NUM_ATOM];
double frcx[NUM_ATOM], frcy[NUM_ATOM], frcz[NUM_ATOM];
double eng_kin, eng_pot;
double t_target, tsum;
int ntcount;
int num_threads;
FILE *fgnuplot, *fsave;
int main( )
{
int step;
clock_t t1,t2;
printf("Input Target Temperature :");
scanf("%lf",&t_target);
#pragma omp parallel
{ num_threads=omp_get_num_threads();}
printf("This code is working on %2d nodes\n",num_threads);
initial( );
initplot( );
t1=clock( );
for (step=0; step<=TOTAL_STEP; step++) {
force( );
move( );
if (step%SAVE_STEP==0) {
statistics(step);
}
}
t2=clock( );
printf("%10.5f sec on %3d nodes\n",(double)(t2-t1)/CLOCKS_PER_SEC,num_threads);
finalplot( );
}
45)
private lists up the private variables which are independent among the threads.
reduction lists up the variables for summation in the loop. The summation can be taken independently among the
threaes, but after the loop, the total sum should be calculated.
46)
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 66
//-----------------------------------------------//
Initialize Gnuplot Command File
//
void initplot( )
{
fgnuplot=fopen("lj.plt","w");
fprintf(fgnuplot,"set
fprintf(fgnuplot,"set
fprintf(fgnuplot,"set
fprintf(fgnuplot,"set
data
size
xran
yran
st point\n");
square\n");
[0:%lf]\n",CELL_X);
[0:%lf]\n",CELL_Y);
}
//-----------------------------------------------//
Finalize Gnuplot Command File
//
void finalplot( )
{
fprintf(fgnuplot,"pause -1\n");
fclose(fgnuplot);
}
//-----------------------------------------------//
Make Simple Cubic Lattice with Random Momenta
//
void initial( )
{
int i=0;
int ix,iy,iz;
double dx,dy,dz;
CELL_X=pow(DENSITY, -1.0/3.0)*NUM_LATTICE;
CELL_Y=CELL_X;
CELL_Z=CELL_X;
dx=CELL_X/NUM_LATTICE;
dy=CELL_Y/NUM_LATTICE;
dz=CELL_Z/NUM_LATTICE;
tsum=0.0;
for (ix=0;ix<NUM_LATTICE;ix++) {
for (iy=0;iy<NUM_LATTICE;iy++) {
for (iz=0;iz<NUM_LATTICE;iz++) {
posx[i] = dx*ix;
posy[i] = dy*iy;
posz[i] = dz*iz;
momx[i] = (rand( )/(double)RAND_MAX -0.5);
momy[i] = (rand( )/(double)RAND_MAX -0.5);
momz[i] = (rand( )/(double)RAND_MAX -0.5);
tsum += (momx[i]*momx[i]+momy[i]*momy[i]+momz[i]*momz[i]);
i++;
} } }
tsum /= 2;
scaling( );
}
//-----------------------------------------------//
Calculate Forces: minimal image convention
//
void force( )
{
int i, j;
double dx,dy,dz,r2,ri06,ri12;
double eng, frc;
double cxh=CELL_X/2;
double cyh=CELL_Y/2;
double czh=CELL_Z/2;
double fix,fiy,fiz;
eng_pot=0.0;
for (i=0; i<NUM_ATOM; i++) {
frcx[i]=0.0;
frcy[i]=0.0;
frcz[i]=0.0;
}
for (i=0; i<NUM_ATOM; i++) {
fix=fiy=fiz=0.0;
#pragma omp parallel for private (dx,dy,dz,r2,ri06,ri12,eng,frc)
reduction(+:eng_pot,fix,fiy,fiz)
for (j=i+1; j<NUM_ATOM; j++) {
dx=posx[i]-posx[j];
if (dx<-cxh) dx+=CELL_X;
else if (dx>cxh) dx-=CELL_X;
dy=posy[i]-posy[j];
if (dy<-cyh) dy+=CELL_Y;
else if (dy>cyh) dy-=CELL_Y;
dz=posz[i]-posz[j];
if (dz<-czh) dz+=CELL_Z;
else if (dz>czh) dz-=CELL_Z;
r2=dx*dx+dy*dy+dz*dz;
if (r2<RCUT*RCUT) {
ri06=1.0/(r2*r2*r2);
ri12=ri06*ri06;
eng=4*(ri12-ri06);
frc=4*(12*ri12-6*ri06)/r2;
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 67
eng_pot
fix
fiy
fiz
frcx[j]
frcy[j]
frcz[j]
+=
+=
+=
+=
-=
-=
-=
eng;
frc*dx;
frc*dy;
frc*dz;
frc*dx;
frc*dy;
frc*dz;
}
}
frcx[i]+=fix;
frcy[i]+=fiy;
frcz[i]+=fiz;
}
}
//-----------------------------------------------//
Update Particle Positon & Momentum: Periodic Boundaries
//
void move( )
{
int i;
eng_kin=0.0;
#pragma omp parallel for reduction(+:eng_kin)
for (i=0; i<NUM_ATOM; i++) {
momx[i] += frcx[i]*DEL_T;
momy[i] += frcy[i]*DEL_T;
momz[i] += frcz[i]*DEL_T;
eng_kin += (momx[i]*momx[i]+momy[i]*momy[i]+momz[i]*momz[i]);
posx[i] += momx[i]*DEL_T;
if (posx[i]<0.0)
else if (posx[i]>CELL_X)
posy[i] += momy[i]*DEL_T;
if (posy[i]<0.0)
else if (posy[i]>CELL_Y)
posz[i] += momz[i]*DEL_T;
if (posz[i]<0.0)
else if (posz[i]>CELL_Z)
posx[i]+=CELL_X;
posx[i]-=CELL_X;
posy[i]+=CELL_Y;
posy[i]-=CELL_Y;
posz[i]+=CELL_Z;
posz[i]-=CELL_Z;
}
eng_kin /= 2;
ntcount++;
tsum += (2.0/3.0)*eng_kin;
}
//-----------------------------------------------//
Output Statistical Data & Configuration
//
void statistics(int step)
{
int i;
char fname[100];
tsum /= ntcount;
scaling( );
printf("%8d %10.5lf %10.5lf %10.5lf\n",
step, eng_kin/NUM_ATOM, eng_pot/NUM_ATOM, (eng_kin+eng_pot)/NUM_ATOM);
sprintf(fname,"lj%8.8d.dat",step);
fsave=fopen(fname,"w");
fprintf(fsave,"%f %f %f",CELL_X,CELL_Y,CELL_Z);
for (i=0;i<NUM_ATOM;i++) {
fprintf(fsave,"%8.3lf %8.3lf %8.3lf ",posx[i],posy[i],posz[i]);
fprintf(fsave,"%8.3lf %8.3lf %8.3lf\n",momx[i],momy[i],momz[i]);
}
fclose(fsave);
fprintf(fgnuplot,"plot \"%s\" using 1:2 t \"%5.5d\" w po\n",fname,step);
if (step==0) fprintf(fgnuplot,"pause -1\n");
}
//-----------------------------------------------//
Velocity Scaling
//
void scaling( )
{
int i;
double factor;
factor=sqrt(t_target*NUM_ATOM/tsum);
for (i=0;i<NUM_ATOM;i++) {
momx[i] *= factor;
momy[i] *= factor;
momz[i] *= factor;
}
tsum=0.0;
ntcount=0;
}
List: A sample code of parallel MD, lj-v05-para.cpp.
原子系の動力学セミナー講義資料 2015 (M. Matsumoto): p. 68
Today’s Homework
No. 4 (May 12)
This homework is optional, because you may not have a proper environment of using OpenMP on
your PC. However, I strongly recommend experiencing parallel calculations for those who are planning
to do large-scale numerical simulations in future.
Send your report to
[email protected]
by May 31.
• Construct an OpenMP envirionment on your PC and confirm that your PC has a multi-core
CPU, by running the test program in Fig. 5-36. You can find a document containing the link
to the download site of Visual Studio 2013 Express edition on my homepage
http://www.mitsuhiromatsumoto.mech.kyoto-u.ac.jp/
• Download a sample code (lj-v05-para.cpp or lj-v05-para2.cpp) of parallel MD simulation and execute it for various N with the same density. If you want, you can check the
multi-core behavior using the performance monitor. Of course, you have to stop other
heavy works when you run your parallel code on your PC.
• Compare the calculation time between the single mode and the multi-core mode.
% of Processor
time, single node
% of Processor
time, dual node
Prof. Nishikawa will give the class next week, on May 19.
Fly UP