...

筑波大学計算科学研究センター CCS HPCサマーセミナー 「最適化Ⅰ」

by user

on
Category: Documents
1

views

Report

Comments

Transcript

筑波大学計算科学研究センター CCS HPCサマーセミナー 「最適化Ⅰ」
Õ²_aîÖÍaÊÑŴƝŶƠ
CCS HPCŮƎƠŴƏſƠ
ħŸĉ?ĤĨ
Ģ©_
[email protected]
Õ²_a_aĕŰŲŻƐ‚ZnaÊÑÍ
îÖÍaÊÑŴƝŶƠ
1
øÝ,e
•! 1îÖªŰŲŻƐʼnîÖƂƠžBŇĭijř
ƉƛŪƗƐʼnŸĉ?ˆ±
–! ƚűŲŶƈƛŹŨƝŪ
–! ŨƓŹŰƔƈƛŹŨƝŪ
–! ŲŽƘƠƏƝŪSIMDOʼn´Â
2
€ßŸƔƠƀƝŪ
•! ŵƇŽţŤŠƟŠƉƘūƠŰƖƝŇĭijřƅƇŦƠƎƝŲ
ʼnČé€ŇŁĪłŊôŔįòùķłĪřƧ
•! ķĮķƦƅƇŦƠƎƝŲʼnŸƔƠƀƝŪŇēķłĪĬŋƦ
ŵƇŽţŤŠđÄŮŢũƙʼnŃ{RķŇņŘįĿŃƦ
ŐŀĽIJބĶŚņĪ[JĹŗĩřƧ
•! ĵʼnŖīņ½°ŇėŀłĪřéSńķłƦ
–! ŬƠžÁ†źƠƙŕŬƝƅŢƗľijŃŠƉƘūƠŰƖƝŝ
Ÿĉ?ŃİřńĪīòù
–! BҟšʼnƉƛŴŹŮŝĬŋŠƉƘūƠŰƖƝdäœŇ
ŸĢʼnƅƇŦƠƎƝŲį|ŗŚřńĪīć8ņ¢y
ĦģįŽĴŗŚřƧ
3
€ßŸƔƠƀƝŪʼnƒÝ
•! ķĮķƦdäҗƞ ĮĮřŖīņîÖŇ
ĭĪłƦŸĉ?ŝäīĵńŇŖŘƦ ʼnŧƠŷƠ
ŃdäœĒŝ6·ŃİřŖīņ[JƧ
•! —%îÖƗŢƈƗƘʼnŖīŇƦ^IJʼnŇŜŚ
řƉƛŪƗƐŃĩŚŋƦŸƔƠƀƝŪŝäī%
Ŋ@0ŇĩřƧ
•! ŸƔƠƀƝŪŇŖŀłƅƇŦƠƎƝŲįŇ39
LķĽńĹŚŋƦļŚŊƦ39€ßʼnĢĪƎ
ŰƝŝÂķłĪřʼnńKĸĵńŇņřƧ
4
Ÿĉ?
•! Ÿĉ?ʼnfúŊĪśĪśĩřƧ
–! ŬƠžčʼn6·
–! żƠŶčʼn6·
–! däœĒʼn6·
•! RŊƦdäœĒŝ6·ĹřĽœŇƉƛŪƗ
Ɛŝİ’ĬřĵńŝħŸĉ?ĨńNōĵńŇ
ĹřƧ
5
Ÿĉ?ʼn5º
•! Ÿĉ?ŝäŀłdäœĒŝ6·ĹřĵńŇŖŘƦ
–! îÖªʼn¡<´Â
–! ĝ¯ƣŐĽŊõĎƤʼn6·
–! KĸœĒŃŖŘ^IJʼnîÖįŃİř
•! ƉƛŪƗƐŝIJœĒƥdäœĒʼnëºĮŗÞĬřńƦ
МĒdäĶŚřƉƛŪƗƐŃĩřŏŅƦŸĉ?ʼnƑ
ƘŹŽŝGŃİřƧ
–! Ÿĉ?ŇŖŀł€ßįŇ39LķĽńĹŚŋƦļŚŊƦ
39€ßʼnĢĪƎŰƝŝÂķłĪřʼnńKĸĵńŇņřƧ
•! 1RķĮdäĶŚĺƦĮŁdäœĒʼnÉĪƉƛŪƗƐ
ŊƦŸĉ?ķłŔĩŐ؃MįņĪƧ
6
Ÿĉ?ŝäī7Ň
•! ļŔļŔƦŸĉ?ŝäī}éįĩřĮƩ
•! ¿VÂĪłĪřŠƙŭƘųƐŊŸĉĮƩ
•! <¾ʼnĪŠƙŭƘųƐŝŸĉ?ķłŔƦƒMįņĪƧ
–! ƄƈƙŵƠŽʼnƉƛŪƗƐŝŸĉ?ķłŔƦũŢŹũŵƠŽ
ŖŘŊĆIJņŗņĪƧ
•! ŸĉņŠƙŭƘųƐŊ
–! ìIJŎİQġʼn€ü
–! ĭīńĹřîÖªʼnŠƠŨŻũŸƓƦƑƒƘčņŅ
ĦŇ_İIJ`Ĺř
7
Ÿĉ?ʼn›ď
•! ƋƝŷƠ‘ʼnĢĆņƗŢƈƗƘįĬř[JŇŊƦ
ŃİřľijīŖīŇĹřƧ
–! BLASƦLAPACKņŅ
•! ŸĄʼnŬƝƅŢƗʼnŸĉ?ß:ŊĞqŇĢIJņŀłĪřƧ
•! ŬƝƅŢƗŃŔŃİřŸĉ?ŊƦƕƠůƠ&ŃŊäŜņĪƧ
–! ˆĒįĮřľijƧ
–! ƉƛŪƗƐįèěŇņŘƄŪį)ŘăŒWį/łIJřƧ
–! ŬƝƅŢƗʼnŸĉ?ß:ŝć"ķņĪƧ
•! ĒŊŠƙŭƘųƐʼn–âŇg~ĹřƧ
•! ŠŴƝƈƗŊŕŒŝ|ņĪ[JŝĖİƦŜņĪƧ
8
Ÿĉ?ʼnÔ­
•! ŐĺƦà0ʼnƉƛŪƗƐŃŅʼnʼn¹Ö€ßį
/łĪřĮŝöŎřƧ
•! ¹Ö€ßʼn§ńķłƦFLOPSƣFloating
Operations Per SecondƤįĩřƧ
–! 1ÎĒŇdäHßņµ=i—º¹ÖʼnR—ŝæĹB
–! MFLOPSƣ10^6ƤƦGFLOPSƣ10^9)ƦTFLOPSƣ10^12Ƥ
•! ƉƛŪƗƐ*ƣŐĽŊĊƤʼndäœĒńƦ¹ÖR
—ĮŗƦFLOPS%ŝÖ/ķƦƉƛŴŹŮʼnÀ÷ƆƠũ
€ßń®āĹřƧ
–! Pentium4ŃĩŚŋƦũƛŹũʼn2$ʼnFLOPS%
–! Intel Core2ƦCore i7ŕAMD Quad-Core OpteronŃĩŚ
ŋƦũƛŹũʼn4$ʼnFLOPS%
9
œĒî¸
•! œĒî¸ŝäīfúńķł
–! ÚćœĒƣelapsed timeƤ
–! CPUœĒƣCPU timeƤ
ĦģįĩřƧ
•! fúńĹřƉƛŪƗƐʼndäœĒįÉĪ[JƦŶŢ
ƎƠʼn×tįÿŘņĪ[JįĩřƧ
–! RĮ]&ŇƙƠƉŝRķł¸cĹřƧ
•! ĵʼn[JƦŬƝƅŢƗʼnŸĉ?ŇŖŘƦƙƠƉįRŀ
łĪņĪĵńŇņř[JįĩřʼnŃ³ƒĹřƧ
–! ŷƏƠƙƠŸƝŝ)ŚřĮƦ¸cfúŝŮƈƙƠŸƝŇķ
łƦ09ŬƝƅŢƙĹřƧ
10
ƌŹŽŲƍŹŽ
•! î֜Ēʼn_AŝC¡ĹřĊ0ŝ
ħƌŹŽŲƍŹŽĨńĪīƧ
•! ŐĺƦŅĵįƌŹŽŲƍŹŽĮŝöŎřƧ
•! 5ņźƠƙńķłƦƉƛƇşŢƗįĩřƧ
–! LinuxŃŊgprofŬƎƝžįĬřƧ
•! ħgcc –pg foo.cĨʼnŖīŇƦŬƝƅŢƗŧƉŰƖƝŇ
ħ-pgĨŝijřĵńŇŖŘƦgprofŇŖŀłÂĶŚřƉƛ
ƇşŢƙ‚ZŝİăŒ¼4ņŬƠžįÁ†ĶŚřƧ
–! a.outŝdäķƦļʼn{Ňgprof a.outńĹřĵńŃƦ
ƌŹŽŲƍŹŽŝ¼cĹřĵńįŃİřƧ
11
gprofʼn/:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self
self total
time seconds seconds calls s/call s/call
48.90
2.90
2.90
2 1.45 2.83
32.38
4.82
1.92 49152 0.00 0.00
14.17
5.66
0.84 16384 0.00 0.00
4.55
5.93
0.27
1 0.27 5.93
0.00
5.93
0.00 16384 0.00 0.00
0.00
5.93
0.00
4 0.00 0.00
0.00
5.93
0.00
3 0.00 1.89
0.00
5.93
0.00
2 0.00 0.00
0.00
5.93
0.00
1 0.00 0.00
name
zfft1d0_
fft8b_
fft8a_
MAIN__
fft235_
factor_
zfft1d_
settbl_
settbls_
12
gprofʼnÛ£Įŗ0Įřĵń
•! ƌŹŽŲƍŹŽŊ
–! zfft1d0_
–! fft8b_
–! fft8a_
Ħģʼn3ŁŃĩŘƦĵʼn3ŁŃ*däœĒʼn95Ƣŝ
¶ûķłĪřƧ
•! ĵŚŗʼnƌŹŽŲƍŹŽʼnőŇÈÆķłŸĉ?ĹŚ
ŋŖĪƧ
•! ƉƛŪƗƐŝïąĹřęŇŊƌŹŽŲƍŹŽįĚĹ
řŖīŇċ„ĹřƧ
•! ƌŹŽŲƍŹŽį^IJĩřńƦŬƠžʼn–â҈Ēį
ĮřƧ
–! Ÿ2ĮŗŬƠžŝİÇķĽ›įŐķņ[JŔĩřƧ
13
ŬƝƅŢƙŧƉŰƖƝ
•! ŬƝƅŢƙŧƉŰƖƝʼncʼn›ŇŖŀłƦ€ß
į_İIJ\?ĹřƧ
•! ŬƝƅŢƗʼnƎƀƔŠƙŝEÞŇƦĪśŞņŬƝƅŢ
ƙŧƉŰƖƝŝñķłőřƧ
–! ħ-fastĨƦħ-O3ĨƦħ-O2ĨƦņŅ
–! Intel CompilerŃŊħ-xSĨƣIntel Core2LijƤ
•! }ĺķŔŸĉ?ƚƋƙŝĢIJķĽĮŗńĪŀłƦĆĪ
ŬƠžŝ/:ĹřńŊĔŗņĪƧ
–! ŬƝƅŢƗįîņŸĉ?ŝäīH߀įĩřĽœƧ
–! îÖÛ£įJŜņĪ[JŔĩřʼnŃ³ƒĹřƧ
14
ŬƝƅŢƗżšƚũŻšƈ
•! ŬƝƅŢƗżšƚũŻšƈƣÌäƤŊƦŬƝƅŢƗŇƉƛ
ŪƗƎʼnƒUŝĬƦŸĉ?ŝ•“ĹřƧ
–! ŬƝƅŢƙŧƉŰƖƝńĈĪƦƙƠƉBџĉ?ŝŬƝŽ
ƛƠƙŃİřƧ
•! żšƚũŻšƈʼn
–! ƋũŽƙ?ŝäīęŇƦƙƠƉʼn`€įņĪĵńŝŬƝƅŢ
ƗҍÌĹřƧ
–! ƋũŽƙ?ʼnŠ¬
•! CíóŃŊħ#pragmaĨƦFortranŃŊħ!dir$Ĩŕħcpgi$lĨ
ņŅŃïąĹřĵńį^ĪƧ
ƣŬƝƅŢƗŇŖŀłĈīĵńįĩřʼnŃ³ƒƤ
15
FortranŃïąķĽZAXPY
subroutine zaxpy(n,a,x,y)
complex*16 a,x(*),y(*)
!dir$ vector aligned
do i=1,n
y(i)=y(i)+a*x(i)
end do
return
end
16
+ - . # ( ' , ' * # & ( ) * ) ! ' # " * ( ) # ' $ * %
17
ƉƛŪƗƐŝïąĹřęʼn³ƒº
•! CŕFortranʼn˜±ŝİĿŞńbřƧ
–! ŬƝƅŢƗŇŖŀłŊƦwarningį/řľijʼnĵńŔĩřįƦ
^IJʼn[JƄŪʼnDSŇņřƧ
•! ŬƝƅŢƗŇ`Ĺř‹vªßŊƦŕŒŝ|ņĪ[J
ƣĬŋżšƚũŻšƈņŅƤŝĖİƦŃİřľijŜņ
ĪŖīŇĹřƧ
–! g77Ňĭijřà=9ċ1
•! real*8 a(n)ŃƦa(n)įu—ŃņIJƦĮŁnį\—ʼnŖīņ[J
–! ƉƛŪƗƐʼnÏ¥€įIJņřƧ
–! ŜňťƗƠʼnDSńņřƧ
•! ĩŐŘŜŚłĪņĪƣńŜŚřƤē—ŕªßŊ
ņřŎIJŜņĪŖīŇĹřƧ
–! ŬƝƅŢƗʼnƄŪįFŘİŚłĪņĪH߀įĩřƧ
18
ï…Ęmƣ1/4Ƥ
/+ƨWikipedia
19
ï…Ęmƣ2/4Ƥ
•! żƠŶŝ!ŒĹřï…çÜʼnŬŲŽƄƗƝŲ
–! ħieč!ĢĆĥ_eč!ĆĨį†ŘÓŁ
•! ħieč!ĢĆĨï…çÜ
–! ƚűŲŶ
•! ħ_eč!ĆĨï…çÜ
–! ƃƠžżšŲũŕ˯ŻƠƉ
•! ħ_eč!ĢĆĨŊŬŲŽƅƇŦƠƎƝŲį
IJd¿TĜ
20
ï…Ęmƣ3/4Ƥ
•! ï…ĘmŊï…YŇfĹřŠũŴŲƅŶƠƝ
ʼnk‡€ƣlocalityƤŝ7‘ŇðîĶŚłĪřƧ
•! k‡€ŇŊ
–! œĒÅk‡€
•! ĩřcʼnŠžƚŲŇfĹřŠũŴŲŊƦ®āÅĄĪ
œĒ,Ň-ÄĹřńĪī€ü
–! ÒĒÅk‡€
•! ĩřcœĒ,ŇŠũŴŲĶŚřżƠŶŊƦ®āÅĄ
ĪŠžƚŲŇ0pĹřńĪī€ü
21
ï…Ęmƣ4/4Ƥ
•! ĵŚŗʼn'LŊƦ>îÖņŅʼnĞ—%îÖ
ŇŊwłŊŐřĵńį^ĪįƦ—%îÖƉƛŪ
ƗƐŃŊáÅŃŊņĪƧ
•! ¼Ň_ê¨ņÍa‰åîÖŇĭĪłŊƦżƠŶ
E»ŇœĒÅk‡€įņĪĵńį^ĪƧ
•! ĵŚįƦÍa‰åîÖŃƋũŽƙXŲƠƅƠŬ
ƝƆƔƠŶį¡5ŃĩŀĽ_İņÀÃƧ
22
BLASʼn€ßƣWoodcrest 2.4GHz
4MB L2 cacheƦIntel MKL 9.1Ƥ
!"
# 23
BLASʼn€ßƣWoodcrest 2.4GHz
4MB L2 cacheƦIntel MKL 9.1Ƥ
24
BLASʼn¹ÖR—
BLAS
ƛƠžR—
ƥ
ŲŽŠR—
Level 1 DAXPY
y = y + !x
3n
2n
3: 2
mn + n + 2m
2mn
1: 2
2mn + mk + kn 2mnk
2:n
Level 2 DGEMV
y = "y + !Ax
Level 3 DGEMM
C = "C + !AB
µ=i
®
—º¹ n=m=k
ÖR—
25
Byte/Flopʼn¦~ƣ1/2Ƥ
•! 1Rʼnµ=i—º¹ÖŝäīęŇ}éņƑƒƘŠũŴ
ŲčŝByte/FlopŃcÝĹřĵńįŃİřƧ
subroutine daxpy(n, a, x, y)
real*8 a, x(*), y(*)
do i = 1, n
y(i) = y(i) + a * x(i)
end do
•! daxpyŃŊƦ1RʼniterationŇŁİƦ2Rʼn$×tµ=
i—º¹ÖŇfķł3Rʼn$×td—żƠŶƣJî
24ByteƤʼnload/storeį}éƧ
–! ĵʼn[JƦ24Byte/2Flop = 12Byte/FlopńņřƧ
•! Byte/Flop%ŊƦiĶĪŏŅâĪƧ
26
Byte/Flopʼn¦~ƣ2/2Ƥ
•! żƔŠƙŬŠʼnIntel Core2ƉƛŴŹŮƣ3GHzƤŃŊƦ
–! À÷ƆƠũ€ßŊ12Gflops x 2ŬŠ=24Gflops
–! ƑƒƘƄƝžrŊŸ_21GB/sƣ4ŸƓƁƙʼn[JƤ
–! Byte/Flop%Ŋ21/24=0.875
•! daxpyŃŊƦƜƠŨƝŪŴŹŽįŨƓŹŰƔʼneč
ŝþĬĽ[JƦƑƒƘƄƝžrƣ21GB/sƤįzĆń
ņřʼnŃƦ21/12=1.75GflopsŊ/ĻņĪƧ
–! À÷ƆƠũ€ßʼnĽŀĽ7%ơ
•! ƑƒƘƄƝžrįd<€ßŝoIĹřĵńŇņřƧ
27
ƙƠƉŠƝƛƠƘƝŪƣ1/2Ƥ
•! ƙƠƉŠƝƛƠƘƝŪńŊƦƙƠƉŝlđĹřĵńŇ
ŖŘƦ
–! ƙƠƉʼnŧƠƄƠƊŹžŝ·ŗĹ
–! ƚűŲŶƈƛŹŨƝŪŝäī
•! ĩŐŘlđķćıřńƦƚűŲŶÿŕOŨƓŹŰƔ
ƏŲŝuİýĵĹʼnŃ³ƒį}éƧ
double A[N], B[N], C;
for (i = 0; i < N; i++) {
A[i] += B[i] * C;
}
double A[N], B[N], C;
for (i = 0; i < N; i += 4) {
A[i] += B[i] * C;
A[i+1] += B[i+1] * C;
A[i+2] += B[i+2] * C;
A[i+3] += B[i+3] * C;
}
28
ƙƠƉŠƝƛƠƘƝŪƣ2/2Ƥ
double A[N][N], B[N][N],
C[N][N], s;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
s = 0.0;
for (k = 0; k < N; k++) {
s += A[i][k] * B[k][j];
}
C[i][j] = s;
}
}
ä1Ðʼn
double A[N][N], B[N][N],
C[N][N], s0, s1;
for (i = 0; i < N; i += 2)
for (j = 0; j < N; j++) {
s0 = 0.0; s1 = 0.0;
for (k = 0; k < N; k++) {
s0 += A[i][k] * B[k][j];
s1 += A[i+1][k] * B[k][j];
}
C[i][j] = s0;
C[i][j+1] = s1;
}
ä1ÐŝŸĉ?ķĽ
29
ƙƠƉʼn)Ś’Ĭ
•! ƙƠƉʼn)Ś’ĬŊƦŇŲŽƗŢžʼn_İņƑƒƘ
E»ŇŖřxğŝĀ·Ĺřˆ±Ƨ
•! ŬƝƅŢƗį3™ķł)Ś’ĬłIJŚřĵńŔĩřƧ
double A[N][N], B[N][N], C;
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
A[k][j] += B[k][j] * C;
}
}
ƙƠƉ)ŚžĬ7
double A[N][N], B[N][N], C;
for (k = 0; k < N; k++) {
for (j = 0; j < N; j++) {
A[k][j] += B[k][j] * C;
}
}
ƙƠƉ)ŚžĬ{
30
ƅżšƝŪ
•! è—ʼnċ1įŨƓŹŰƔʼnKĸÜŇƎŹƆƝŪĶŚł
ķŐĪƦŲƗŹŰƝŪįÁĸř[JŇ¡<Ƨ
–! ¼ŇŮŢųį2ʼnŎİńņřċ1ʼn[J
•! «(ċ1ʼncÝŮŢųŝjķ\ĬłőřƧ
•! ŬƝƅŢƙŧƉŰƖƝŝcĹřńäŀłIJŚřŔʼnŔĩřƧ
double A[N][N], B[N][N];
for (k = 0; k< N; k++) {
for (j = 0; j < N; j++) {
A[j][k] = B[k][j];
}
}
ƅżšƝŪŝäī7
double A[N][N+1], B[N][N+1];
for (k = 0; k < N; k++) {
for (j = 0; j < N; j++) {
A[j][k] = B[k][j];
}
}
ƅżšƝŪŝäŀĽ{
31
ƈƛŹũ?ƣ1/2Ƥ
•! ƑƒƘE»ŝŸĉ?ĹřĽœʼn¡<ņ›±Ƨ
•! ŨƓŹŰƔƏŲŝŃİřľij·ŗĹƧ
double A[N][N], B[N][N], C;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
A[i][j] += B[j][i] * C;
}
}
double A[N][N], B[N][N], C;
for (i = 0; i < N; i += 4) {
for (j = 0; j < N; j += 4) {
for (ii = i; ii < i + 4; ii++) {
for (jj = j; jj < j + 4; jj++) {
A[ii][jj] += B[jj][i] * C;
}
}
}
32
}
ƈƛŹũ?ƣ2/2Ƥ
ƈƛŹũ?ķņĪ[Jʼn
ƈƛŹũ?ķĽ[Jʼn
ƑƒƘŠũŴŲƅŶƠƝ
ƑƒƘŠũŴŲƅŶƠƝ
33
ŲŽƘƠƏƝŪSIMDOʼn´Â
•! µ=i—º¹ÖŝŖŘĢĆŇ.ÀĹřĽœŇƦ
ŸĄʼnƉƛŴŹŮŃŊŲŽƘƠƏƝŪSIMDOń
NŋŚřŔʼnŝ”ĂķłĪřŔʼnį^ĪƧ
–! Intel Pentium4/XeonʼnSSE/SSE2/SSE3/SSE4/
SSSE4
–! AMD Athlonʼn3DNow!
–! Motorola PowerPCʼnAltiVec
•! ŸšʼnIntel Core2ŃŊƦSSE3Oŝ´ÂĹřĵ
ńŃƦµ=i—º¹Ö€ßŝ4$ŇĹřĵńįŃ
İřƧ
34
Intel SSE3O
•! SSE2OńŊƦIntel Pentium4/XeonĮŗh)ĶŚĽƦ
x87OŇŜřšķĪ¹ÖOŃĩřįƦSSE3O
ŊSSE2OŇ;ĬłƦšĽŇ13#ʼnOŝij;
ĬĽŔʼnƧ
–! 128bitĐʼnżƠŶŇfķłƦSIMD.ÀŝäīĵńįŃİřƧ
–! Intel Pentium4ĭŖŌXeonƉƛŴŹŮŇŊ128bitʼnXMMƚ
űŲŶįXMM0ƪXMM7ʼn8#”ĂĶŚłĪřƧ
–! AMD OpteronƉƛŴŹŮŕƦXeon EM64TƉƛŴŹŮŇŊ
XMM0ƪXMM15ʼn16#”ĂĶŚłĪřƧ
•! SSE3ʼnƋũŽƙOŝÂĪřĵńŇŖŀłƦ64bitʼn$
×tµ=i—º¹ÖŃŊƋũŽƙĐį2ʼnƋũŽƙ¹
Öƣ;·ĖƦs›¤Ʀ÷À¹ÖƤŝäīĵńįŃİřƧ
35
SSE3Oʼn5›±
•! SSE3Oʼn5›±ńķłŊƦʼn›±į
ŽĴŗŚřƧ
(1) ŬƝƅŢƗŇŖŘƋũŽƙ?Ĺř›±
(2) SSE3Ùőăőē—ŝÂĹř›±
(3) ŢƝƗŢƝŠŴƝƈƗŝÂĹř›±
(4) ŠŴƝƈƗŃħ.sĨƇşŢƙŝǐïąĹř›±
•! (1)ƪ(4)ʼnĠŇŬƠżšƝŪįèěŇņřįƦ
€ßńĪīëºĮŗŊ¡5ŇņřƧ
36
$×tèؗʼnÐP¹Öƣa + b * cƤ
ŝSSE3Ùőăőē—ŃïąķĽ
#include <pmmintrin.h>
/* SSE3Oŝī[JʼnƊŹŷƇşŢƙ */
static __inline __m128d ZMULADD(__m128d a, __m128d b, __m128d c)
{
__m128d br, bi;
/* 128bitʼnżƠŶXŃ\—ŝcÝ */
br = _mm_movedup_pd(b);
br = _mm_mul_pd(br, c);
a = _mm_add_pd(a, br);
bi = _mm_unpackhi_pd(b, b);
c = _mm_shuffle_pd(c, c, 1);
bi = _mm_mul_pd(bi, c);
/* br = [b.r b.r] dĊʼnőŝFŘ/Ĺ*/
/* br = [b.r*c.r b.r*c.i] */
/* a = [a.r+b.r*c.r a.i+b.r*c.i] */
/* bi = [b.i b.i] ãĊʼnőŝFŘ/Ĺ */
/* c = [c.i c.r] dĊńãĊŝ)ŚžĬř */
/* bi = [-b.i*c.i b.i*c.r] */
return _mm_addsub_pd(a, bi);
/* [a.r+b.r*c.r-b.i*c.i a.i+b.r*c.i+b.i*c.r] */
}
37
CŃïąķĽZAXPY
typedef struct { double r, I; } doublecomplex;
void zaxpy(int n, doublecomplex a, doublecomplex *x, doublecomplex *y)
{
int i;
if (a.r == 0.0 && a.i == 0.0) return;
#pragma unroll(8)
#pragma vector aligned
for (i = 0; i < n; i++) {
y[i].r += a.r * x[i].r – a.i * x[i].i,
y[i].i += a.r * x[i].i + a.i * x[i].r;
}
38
SSE3Ùőăőē—ŇŖřZAXPY
#include <pmmintrin.h>
typedef struct { double r, i; } doublecomplex;
__m128d ZMULADD(__m128d a, __m128d b, __m128d c);
void zaxpy(int n, doublecomplex a, doublecomplex *x, doublecomplex *y)
{
int i;
__m128d a0;
if (a.r == 0.0 && a.i == 0.0) return;
a0 = _mm_loadu_pd(&a);
#pragma unroll(8)
for (i = 0; i < n; i++)
_mm_store_pd(&y[i], ZMULADD(_mm_load_pd(&y[i]), a0, _mm_load_pd(&x[i])));
}
39
) + /-) + . ) % 0& % /, -
. % $* $-
+ /& * #! " 2 3 !& , + 1 & % / , - & + ' / (
40
õġ
•! ŇÌĹä1ÐŝäīƉƛŪƗƐŝŸĉ?ķƦ
Ÿĉ?Ĺř7ńʼndäœĒŝ®āĻŖƧ
#include <stdio.h>
#define N 1000
int main(void)
{
static double a[N][N], b[N][N], c[N][N];
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
a[i][j] = rand(); b[i][j] = rand(); c[i][j] = rand();
}
}
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return 0;
}
41
Fly UP