Comments
Description
Transcript
筑波大学計算科学研究センター CCS HPCサマーセミナー 「最適化Ⅰ」
Õ²_aîÖÍaÊÑŴƝŶƠ CCS HPCŮƎƠŴƏſƠ ħĉ?ĤĨ Ģ©_ [email protected] Õ²_a_aĕŰŲŻƐZnaÊÑÍ îÖÍaÊÑŴƝŶƠ 1 øÝ,e •! 1îÖªŰŲŻƐʼnîÖƂƠžBŇĭijř ƉƛŪƗƐʼnĉ?± –! ƚűŲŶƈƛŹŨƝŪ –! ŨƓŹŰƔƈƛŹŨƝŪ –! ŲŽƘƠƏƝŪSIMDOʼn´Â 2 ߟƔƠƀƝŪ •! ŵƇŽţŤŠƟŠƉƘūƠŰƖƝŇĭijřƅƇŦƠƎƝŲ ʼnČéŇŁĪłŊôŔįòùķłĪřƧ •! ķĮķƦƅƇŦƠƎƝŲʼnŸƔƠƀƝŪŇēķłĪĬŋƦ ŵƇŽţŤŠđÄŮŢũƙʼnŃ{RķŇņŘįĿŃƦ ŐŀĽIJÞĶŚņĪ[JĹŗĩřƧ •! ĵʼnŖīņ½°ŇėŀłĪřéSńķłƦ –! ŬƠžÁźƠƙŕŬƝƅŢƗľijŃŠƉƘūƠŰƖƝŝ ĉ?ŃİřńĪīòù –! BŇʼnƉƛŴŹŮŝĬŋŠƉƘūƠŰƖƝdäŇ ĢʼnƅƇŦƠƎƝŲį|ŗŚřńĪīć8ņ¢y ĦģįĴŗŚřƧ 3 ߟƔƠƀƝŪʼnÝ •! ķĮķƦdäŇƞ ĮĮřŖīņîÖŇ ĭĪłƦĉ?ŝäīĵńŇŖŘƦ ʼnŧƠŷƠ ŃdäĒŝ6·ŃİřŖīņ[JƧ •! %îÖƗŢƈƗƘʼnŖīŇƦ^IJʼnŇŜŚ řƉƛŪƗƐŃĩŚŋƦŸƔƠƀƝŪŝäī% Ŋ@0ŇĩřƧ •! ŸƔƠƀƝŪŇŖŀłƅƇŦƠƎƝŲįŇ39 LķĽńĹŚŋƦļŚŊƦ39ßʼnĢĪƎ ŰƝŝÂķłĪřʼnńKĸĵńŇņřƧ 4 ĉ? •! ĉ?ʼnfúŊĪśĪśĩřƧ –! ŬƠžčʼn6· –! żƠŶčʼn6· –! däĒʼn6· •! RŊƦdäĒŝ6·ĹřĽœŇƉƛŪƗ ƐŝİĬřĵńŝħĉ?ĨńNōĵńŇ ĹřƧ 5 ĉ?ʼn5º •! ĉ?ŝäŀłdäĒŝ6·ĹřĵńŇŖŘƦ –! îÖªʼn¡<´Â –! ĝ¯ƣŐĽŊõĎƤʼn6· –! KĸĒŃŖŘ^IJʼnîÖįŃİř •! ƉƛŪƗƐŝIJĒƥdäĒʼnëºĮŗÞĬřńƦ ĐĒdäĶŚřƉƛŪƗƐŃĩřŏŅƦĉ?ʼnƑ ƘŹŽŝGŃİřƧ –! ĉ?ŇŖŀłßįŇ39LķĽńĹŚŋƦļŚŊƦ 39ßʼnĢĪƎŰƝŝÂķłĪřʼnńKĸĵńŇņřƧ •! 1RķĮdäĶŚĺƦĮŁdäĒʼnÉĪƉƛŪƗƐ ŊƦĉ?ķłŔĩŐŘMįņĪƧ 6 ĉ?ŝäī7Ň •! ļŔļŔƦĉ?ŝäī}éįĩřĮƩ •! ¿VÂĪłĪřŠƙŭƘųƐŊĉĮƩ •! <¾ʼnĪŠƙŭƘųƐŝĉ?ķłŔƦMįņĪƧ –! ƄƈƙŵƠŽʼnƉƛŪƗƐŝĉ?ķłŔƦũŢŹũŵƠŽ ŖŘŊĆIJņŗņĪƧ •! ĉņŠƙŭƘųƐŊ –! ìIJŎİQġʼnü –! ĭīńĹřîÖªʼnŠƠŨŻũŸƓƦƑƒƘčņŅ ĦŇ_İIJ`Ĺř 7 ĉ?ʼnď •! ƋƝŷƠʼnĢĆņƗŢƈƗƘįĬř[JŇŊƦ ŃİřľijīŖīŇĹřƧ –! BLASƦLAPACKņŅ •! ĄʼnŬƝƅŢƗʼnĉ?ß:ŊĞqŇĢIJņŀłĪřƧ •! ŬƝƅŢƗŃŔŃİřĉ?ŊƦƕƠůƠ&ŃŊäŜņĪƧ –! ĒįĮřľijƧ –! ƉƛŪƗƐįèěŇņŘƄŪį)ŘăŒWį/łIJřƧ –! ŬƝƅŢƗʼnĉ?ß:ŝć"ķņĪƧ •! ĒŊŠƙŭƘųƐʼnâŇg~ĹřƧ •! ŠŴƝƈƗŊŕŒŝ|ņĪ[JŝĖİƦŜņĪƧ 8 ĉ?ʼnÔ •! ŐĺƦà0ʼnƉƛŪƗƐŃŅʼnʼn¹Ößį /łĪřĮŝöŎřƧ •! ¹Ößʼn§ńķłƦFLOPSƣFloating Operations Per SecondƤįĩřƧ –! 1ÎĒŇdäHßņµ=iº¹ÖʼnRŝæĹB –! MFLOPSƣ10^6ƤƦGFLOPSƣ10^9)ƦTFLOPSƣ10^12Ƥ •! ƉƛŪƗƐ*ƣŐĽŊĊƤʼndäĒńƦ¹ÖR ĮŗƦFLOPS%ŝÖ/ķƦƉƛŴŹŮʼnÀ÷ƆƠũ ßń®āĹřƧ –! Pentium4ŃĩŚŋƦũƛŹũʼn2$ʼnFLOPS% –! Intel Core2ƦCore i7ŕAMD Quad-Core OpteronŃĩŚ ŋƦũƛŹũʼn4$ʼnFLOPS% 9 Ēî¸ •! Ēî¸ŝäīfúńķł –! ÚćĒƣelapsed timeƤ –! CPUĒƣCPU timeƤ ĦģįĩřƧ •! fúńĹřƉƛŪƗƐʼndäĒįÉĪ[JƦŶŢ ƎƠʼn×tįÿŘņĪ[JįĩřƧ –! RĮ]&ŇƙƠƉŝRķł¸cĹřƧ •! ĵʼn[JƦŬƝƅŢƗʼnĉ?ŇŖŘƦƙƠƉįRŀ łĪņĪĵńŇņř[JįĩřʼnŃ³ĹřƧ –! ŷƏƠƙƠŸƝŝ)ŚřĮƦ¸cfúŝŮƈƙƠŸƝŇķ łƦ09ŬƝƅŢƙĹřƧ 10 ƌŹŽŲƍŹŽ •! îÖĒʼn_AŝC¡ĹřĊ0ŝ ħƌŹŽŲƍŹŽĨńĪīƧ •! ŐĺƦŅĵįƌŹŽŲƍŹŽĮŝöŎřƧ •! 5ņźƠƙńķłƦƉƛƇşŢƗįĩřƧ –! LinuxŃŊgprofŬƎƝžįĬřƧ •! ħgcc –pg foo.cĨʼnŖīŇƦŬƝƅŢƗŧƉŰƖƝŇ ħ-pgĨŝijřĵńŇŖŘƦgprofŇŖŀłÂĶŚřƉƛ ƇşŢƙZŝİăŒ¼4ņŬƠžįÁĶŚřƧ –! a.outŝdäķƦļʼn{Ňgprof a.outńĹřĵńŃƦ ƌŹŽŲƍŹŽŝ¼cĹřĵńįŃİřƧ 11 gprofʼn/: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call 48.90 2.90 2.90 2 1.45 2.83 32.38 4.82 1.92 49152 0.00 0.00 14.17 5.66 0.84 16384 0.00 0.00 4.55 5.93 0.27 1 0.27 5.93 0.00 5.93 0.00 16384 0.00 0.00 0.00 5.93 0.00 4 0.00 0.00 0.00 5.93 0.00 3 0.00 1.89 0.00 5.93 0.00 2 0.00 0.00 0.00 5.93 0.00 1 0.00 0.00 name zfft1d0_ fft8b_ fft8a_ MAIN__ fft235_ factor_ zfft1d_ settbl_ settbls_ 12 gprofʼnÛ£Įŗ0Įřĵń •! ƌŹŽŲƍŹŽŊ –! zfft1d0_ –! fft8b_ –! fft8a_ Ħģʼn3ŁŃĩŘƦĵʼn3ŁŃ*däĒʼn95Ƣŝ ¶ûķłĪřƧ •! ĵŚŗʼnƌŹŽŲƍŹŽʼnőŇÈÆķłĉ?ĹŚ ŋŖĪƧ •! ƉƛŪƗƐŝïąĹřęŇŊƌŹŽŲƍŹŽįĚĹ řŖīŇċĹřƧ •! ƌŹŽŲƍŹŽį^IJĩřńƦŬƠžʼnâŇĒį ĮřƧ –! 2ĮŗŬƠžŝİÇķĽįŐķņ[JŔĩřƧ 13 ŬƝƅŢƙŧƉŰƖƝ •! ŬƝƅŢƙŧƉŰƖƝʼncʼnŇŖŀłƦß į_İIJ\?ĹřƧ •! ŬƝƅŢƗʼnƎƀƔŠƙŝEÞŇƦĪśŞņŬƝƅŢ ƙŧƉŰƖƝŝñķłőřƧ –! ħ-fastĨƦħ-O3ĨƦħ-O2ĨƦņŅ –! Intel CompilerŃŊħ-xSĨƣIntel Core2LijƤ •! }ĺķŔĉ?ƚƋƙŝĢIJķĽĮŗńĪŀłƦĆĪ ŬƠžŝ/:ĹřńŊĔŗņĪƧ –! ŬƝƅŢƗįîņĉ?ŝäīHßįĩřĽœƧ –! îÖÛ£įJŜņĪ[JŔĩřʼnŃ³ĹřƧ 14 ŬƝƅŢƗżšƚũŻšƈ •! ŬƝƅŢƗżšƚũŻšƈƣÌäƤŊƦŬƝƅŢƗŇƉƛ ŪƗƎʼnUŝĬƦĉ?ŝĹřƧ –! ŬƝƅŢƙŧƉŰƖƝńĈĪƦƙƠƉBŃĉ?ŝŬƝŽ ƛƠƙŃİřƧ •! żšƚũŻšƈʼn –! ƋũŽƙ?ŝäīęŇƦƙƠƉʼn`įņĪĵńŝŬƝƅŢ ƗŇÌĹřƧ –! ƋũŽƙ?ʼn¬ •! CíóŃŊħ#pragmaĨƦFortranŃŊħ!dir$Ĩŕħcpgi$lĨ ņŅŃïąĹřĵńį^ĪƧ ƣŬƝƅŢƗŇŖŀłĈīĵńįĩřʼnŃ³Ƥ 15 FortranŃïąķĽZAXPY subroutine zaxpy(n,a,x,y) complex*16 a,x(*),y(*) !dir$ vector aligned do i=1,n y(i)=y(i)+a*x(i) end do return end 16 + - . # ( ' , ' * # & ( ) * ) ! ' # " * ( ) # ' $ * % 17 ƉƛŪƗƐŝïąĹřęʼn³º •! CŕFortranʼn±ŝİĿŞńbřƧ –! ŬƝƅŢƗŇŖŀłŊƦwarningį/řľijʼnĵńŔĩřįƦ ^IJʼn[JƄŪʼnDSŇņřƧ •! ŬƝƅŢƗŇ`ĹřvªßŊƦŕŒŝ|ņĪ[J ƣĬŋżšƚũŻšƈņŅƤŝĖİƦŃİřľijŜņ ĪŖīŇĹřƧ –! g77Ňĭijřà=9ċ1 •! real*8 a(n)ŃƦa(n)įuŃņIJƦĮŁnį\ʼnŖīņ[J –! ƉƛŪƗƐʼnÏ¥įIJņřƧ –! ŜňťƗƠʼnDSńņřƧ •! ĩŐŘŜŚłĪņĪƣńŜŚřƤēŕªßŊ ņřŎIJŜņĪŖīŇĹřƧ –! ŬƝƅŢƗʼnƄŪįFŘİŚłĪņĪHßįĩřƧ 18 ï Ęmƣ1/4Ƥ /+ƨWikipedia 19 ï Ęmƣ2/4Ƥ •! żƠŶŝ!Ĺřï çÜʼnŬŲŽƄƗƝŲ –! ħieč!ĢĆĥ_eč!ĆĨįŘÓŁ •! ħieč!ĢĆĨï çÜ –! ƚűŲŶ •! ħ_eč!ĆĨï çÜ –! ƃƠžżšŲũŕ˯ŻƠƉ •! ħ_eč!ĢĆĨŊŬŲŽƅƇŦƠƎƝŲį IJd¿TĜ 20 ï Ęmƣ3/4Ƥ •! ï ĘmŊï YŇfĹřŠũŴŲƅŶƠƝ ʼnkƣlocalityƤŝ7ŇðîĶŚłĪřƧ •! kŇŊ –! ĒÅk •! ĩřcʼnŠžƚŲŇfĹřŠũŴŲŊƦ®āÅĄĪ Ē,Ň-ÄĹřńĪīü –! ÒĒÅk •! ĩřcĒ,ŇŠũŴŲĶŚřżƠŶŊƦ®āÅĄ ĪŠžƚŲŇ0pĹřńĪīü 21 ï Ęmƣ4/4Ƥ •! ĵŚŗʼn'LŊƦ>îÖņŅʼnĞ%îÖ ŇŊwłŊŐřĵńį^ĪįƦ%îÖƉƛŪ ƗƐŃŊáÅŃŊņĪƧ •! ¼Ň_ê¨ņÍaåîÖŇĭĪłŊƦżƠŶ E»ŇĒÅkįņĪĵńį^ĪƧ •! ĵŚįƦÍaåîÖŃƋũŽƙXŲƠƅƠŬ ƝƆƔƠŶį¡5ŃĩŀĽ_İņÀÃƧ 22 BLASʼnßƣWoodcrest 2.4GHz 4MB L2 cacheƦIntel MKL 9.1Ƥ !" # 23 BLASʼnßƣWoodcrest 2.4GHz 4MB L2 cacheƦIntel MKL 9.1Ƥ 24 BLASʼn¹ÖR BLAS ƛƠžR ƥ ŲŽŠR Level 1 DAXPY y = y + !x 3n 2n 3: 2 mn + n + 2m 2mn 1: 2 2mn + mk + kn 2mnk 2:n Level 2 DGEMV y = "y + !Ax Level 3 DGEMM C = "C + !AB µ=i ® º¹ n=m=k ÖR 25 Byte/Flopʼn¦~ƣ1/2Ƥ •! 1Rʼnµ=iº¹ÖŝäīęŇ}éņƑƒƘŠũŴ ŲčŝByte/FlopŃcÝĹřĵńįŃİřƧ subroutine daxpy(n, a, x, y) real*8 a, x(*), y(*) do i = 1, n y(i) = y(i) + a * x(i) end do •! daxpyŃŊƦ1RʼniterationŇŁİƦ2Rʼn$×tµ= iº¹ÖŇfķł3Rʼn$×tdżƠŶƣJî 24ByteƤʼnload/storeį}éƧ –! ĵʼn[JƦ24Byte/2Flop = 12Byte/FlopńņřƧ •! Byte/Flop%ŊƦiĶĪŏŅâĪƧ 26 Byte/Flopʼn¦~ƣ2/2Ƥ •! żƔŠƙŬŠʼnIntel Core2ƉƛŴŹŮƣ3GHzƤŃŊƦ –! À÷ƆƠũßŊ12Gflops x 2ŬŠ=24Gflops –! ƑƒƘƄƝžrŊ_21GB/sƣ4ŸƓƁƙʼn[JƤ –! Byte/Flop%Ŋ21/24=0.875 •! daxpyŃŊƦƜƠŨƝŪŴŹŽįŨƓŹŰƔʼneč ŝþĬĽ[JƦƑƒƘƄƝžrƣ21GB/sƤįzĆń ņřʼnŃƦ21/12=1.75GflopsŊ/ĻņĪƧ –! À÷ƆƠũßʼnĽŀĽ7%ơ •! ƑƒƘƄƝžrįd<ßŝoIĹřĵńŇņřƧ 27 ƙƠƉŠƝƛƠƘƝŪƣ1/2Ƥ •! ƙƠƉŠƝƛƠƘƝŪńŊƦƙƠƉŝlđĹřĵńŇ ŖŘƦ –! ƙƠƉʼnŧƠƄƠƊŹžŝ·ŗĹ –! ƚűŲŶƈƛŹŨƝŪŝäī •! ĩŐŘlđķćıřńƦƚűŲŶÿŕOŨƓŹŰƔ ƏŲŝuİýĵĹʼnŃ³į}éƧ double A[N], B[N], C; for (i = 0; i < N; i++) { A[i] += B[i] * C; } double A[N], B[N], C; for (i = 0; i < N; i += 4) { A[i] += B[i] * C; A[i+1] += B[i+1] * C; A[i+2] += B[i+2] * C; A[i+3] += B[i+3] * C; } 28 ƙƠƉŠƝƛƠƘƝŪƣ2/2Ƥ double A[N][N], B[N][N], C[N][N], s; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { s = 0.0; for (k = 0; k < N; k++) { s += A[i][k] * B[k][j]; } C[i][j] = s; } } ä1Ðʼn double A[N][N], B[N][N], C[N][N], s0, s1; for (i = 0; i < N; i += 2) for (j = 0; j < N; j++) { s0 = 0.0; s1 = 0.0; for (k = 0; k < N; k++) { s0 += A[i][k] * B[k][j]; s1 += A[i+1][k] * B[k][j]; } C[i][j] = s0; C[i][j+1] = s1; } ä1Ðŝĉ?ķĽ 29 ƙƠƉʼn)ŚĬ •! ƙƠƉʼn)ŚĬŊƦŇŲŽƗŢžʼn_İņƑƒƘ E»ŇŖřxğŝĀ·Ĺř±Ƨ •! ŬƝƅŢƗį3ķł)ŚĬłIJŚřĵńŔĩřƧ double A[N][N], B[N][N], C; for (j = 0; j < N; j++) { for (k = 0; k < N; k++) { A[k][j] += B[k][j] * C; } } ƙƠƉ)ŚĬ7 double A[N][N], B[N][N], C; for (k = 0; k < N; k++) { for (j = 0; j < N; j++) { A[k][j] += B[k][j] * C; } } ƙƠƉ)ŚĬ{ 30 ƅżšƝŪ •! èʼnċ1įŨƓŹŰƔʼnKĸÜŇƎŹƆƝŪĶŚł ķŐĪƦŲƗŹŰƝŪįÁĸř[JŇ¡<Ƨ –! ¼ŇŮŢųį2ʼnŎİńņřċ1ʼn[J •! «(ċ1ʼncÝŮŢųŝjķ\ĬłőřƧ •! ŬƝƅŢƙŧƉŰƖƝŝcĹřńäŀłIJŚřŔʼnŔĩřƧ double A[N][N], B[N][N]; for (k = 0; k< N; k++) { for (j = 0; j < N; j++) { A[j][k] = B[k][j]; } } ƅżšƝŪŝäī7 double A[N][N+1], B[N][N+1]; for (k = 0; k < N; k++) { for (j = 0; j < N; j++) { A[j][k] = B[k][j]; } } ƅżšƝŪŝäŀĽ{ 31 ƈƛŹũ?ƣ1/2Ƥ •! ƑƒƘE»ŝĉ?ĹřĽœʼn¡<ņ±Ƨ •! ŨƓŹŰƔƏŲŝŃİřľij·ŗĹƧ double A[N][N], B[N][N], C; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { A[i][j] += B[j][i] * C; } } double A[N][N], B[N][N], C; for (i = 0; i < N; i += 4) { for (j = 0; j < N; j += 4) { for (ii = i; ii < i + 4; ii++) { for (jj = j; jj < j + 4; jj++) { A[ii][jj] += B[jj][i] * C; } } } 32 } ƈƛŹũ?ƣ2/2Ƥ ƈƛŹũ?ķņĪ[Jʼn ƈƛŹũ?ķĽ[Jʼn ƑƒƘŠũŴŲƅŶƠƝ ƑƒƘŠũŴŲƅŶƠƝ 33 ŲŽƘƠƏƝŪSIMDOʼn´Â •! µ=iº¹ÖŝŖŘĢĆŇ.ÀĹřĽœŇƦ ĄʼnƉƛŴŹŮŃŊŲŽƘƠƏƝŪSIMDOń NŋŚřŔʼnŝĂķłĪřŔʼnį^ĪƧ –! Intel Pentium4/XeonʼnSSE/SSE2/SSE3/SSE4/ SSSE4 –! AMD Athlonʼn3DNow! –! Motorola PowerPCʼnAltiVec •! ʼnIntel Core2ŃŊƦSSE3Oŝ´ÂĹřĵ ńŃƦµ=iº¹Ößŝ4$ŇĹřĵńįŃ İřƧ 34 Intel SSE3O •! SSE2OńŊƦIntel Pentium4/XeonĮŗh)ĶŚĽƦ x87OŇŜřķĪ¹ÖOŃĩřįƦSSE3O ŊSSE2OŇ;ĬłƦĽŇ13#ʼnOŝij; ĬĽŔʼnƧ –! 128bitĐʼnżƠŶŇfķłƦSIMD.ÀŝäīĵńįŃİřƧ –! Intel Pentium4ĭŖŌXeonƉƛŴŹŮŇŊ128bitʼnXMMƚ űŲŶįXMM0ƪXMM7ʼn8#ĂĶŚłĪřƧ –! AMD OpteronƉƛŴŹŮŕƦXeon EM64TƉƛŴŹŮŇŊ XMM0ƪXMM15ʼn16#ĂĶŚłĪřƧ •! SSE3ʼnƋũŽƙOŝÂĪřĵńŇŖŀłƦ64bitʼn$ ×tµ=iº¹ÖŃŊƋũŽƙĐį2ʼnƋũŽƙ¹ Öƣ;·ĖƦs¤Ʀ÷À¹ÖƤŝäīĵńįŃİřƧ 35 SSE3Oʼn5± •! SSE3Oʼn5±ńķłŊƦʼn±į ĴŗŚřƧ (1) ŬƝƅŢƗŇŖŘƋũŽƙ?Ĺř± (2) SSE3ÙőăőēŝÂĹř± (3) ŢƝƗŢƝŠŴƝƈƗŝÂĹř± (4) ŠŴƝƈƗŃħ.sĨƇşŢƙŝÇïąĹř± •! (1)ƪ(4)ʼnĠŇŬƠżšƝŪįèěŇņřįƦ ßńĪīëºĮŗŊ¡5ŇņřƧ 36 $×tèØʼnÐP¹Öƣa + b * cƤ ŝSSE3ÙőăőēŃïąķĽ #include <pmmintrin.h> /* SSE3Oŝī[JʼnƊŹŷƇşŢƙ */ static __inline __m128d ZMULADD(__m128d a, __m128d b, __m128d c) { __m128d br, bi; /* 128bitʼnżƠŶXŃ\ŝcÝ */ br = _mm_movedup_pd(b); br = _mm_mul_pd(br, c); a = _mm_add_pd(a, br); bi = _mm_unpackhi_pd(b, b); c = _mm_shuffle_pd(c, c, 1); bi = _mm_mul_pd(bi, c); /* br = [b.r b.r] dĊʼnőŝFŘ/Ĺ*/ /* br = [b.r*c.r b.r*c.i] */ /* a = [a.r+b.r*c.r a.i+b.r*c.i] */ /* bi = [b.i b.i] ãĊʼnőŝFŘ/Ĺ */ /* c = [c.i c.r] dĊńãĊŝ)ŚĬř */ /* bi = [-b.i*c.i b.i*c.r] */ return _mm_addsub_pd(a, bi); /* [a.r+b.r*c.r-b.i*c.i a.i+b.r*c.i+b.i*c.r] */ } 37 CŃïąķĽZAXPY typedef struct { double r, I; } doublecomplex; void zaxpy(int n, doublecomplex a, doublecomplex *x, doublecomplex *y) { int i; if (a.r == 0.0 && a.i == 0.0) return; #pragma unroll(8) #pragma vector aligned for (i = 0; i < n; i++) { y[i].r += a.r * x[i].r – a.i * x[i].i, y[i].i += a.r * x[i].i + a.i * x[i].r; } 38 SSE3ÙőăőēŇŖřZAXPY #include <pmmintrin.h> typedef struct { double r, i; } doublecomplex; __m128d ZMULADD(__m128d a, __m128d b, __m128d c); void zaxpy(int n, doublecomplex a, doublecomplex *x, doublecomplex *y) { int i; __m128d a0; if (a.r == 0.0 && a.i == 0.0) return; a0 = _mm_loadu_pd(&a); #pragma unroll(8) for (i = 0; i < n; i++) _mm_store_pd(&y[i], ZMULADD(_mm_load_pd(&y[i]), a0, _mm_load_pd(&x[i]))); } 39 ) + /-) + . ) % 0& % /, - . % $* $- + /& * #! " 2 3 !& , + 1 & % / , - & + ' / ( 40 õġ •! ŇÌĹä1ÐŝäīƉƛŪƗƐŝĉ?ķƦ ĉ?Ĺř7ńʼndäĒŝ®āĻŖƧ #include <stdio.h> #define N 1000 int main(void) { static double a[N][N], b[N][N], c[N][N]; int i, j, k; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { a[i][j] = rand(); b[i][j] = rand(); c[i][j] = rand(); } } for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; } 41