Raspi2 Floating Performance | たけおか ぼちぼち日記

たけおか ぼちぼち日記

思いついたらメモ

Raspberry Pi 2を動かした
Dhrystone など、整数性能は、クロック比程度の性能向上。
Raspi1は、ARM11@700MHz で、Raspi2 は、ARM Cortex-A7@1000MHz。
ARM11は、なかなかいい機械だったので、Cortex-A7と比べてもあまり見劣りしない。

浮動小数点演算 性能が気になる。
組込みCPUなので、期待をしてはいけないが…
Raspi1 の FPU は、ARMの VFP で、ショート・ベクトルとか言っているが、
どないもこないも、遅いしろものであった…

Raspi2 のFPUは、また別物。

Raspbian 標準の gcc (gcc 4.6.3) に
CFLAGS = -O4 -mfp=3 -march=armv7-a -mfpu=vfpv3-d16
で、linpack を make。

システムクロックは、1000MHz と 標準の700MHz で計測。



--- 1000MHz ---

take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.58 86.21% 10.34% 3.45% 156952.375
128 1.16 91.38% 0.00% 8.62% 165836.516
256 2.32 88.79% 1.72% 9.48% 167415.844
512 4.62 89.18% 3.46% 7.36% 164286.562
1024 9.27 87.59% 3.02% 9.39% 167415.984
2048 18.54 88.03% 2.10% 9.87% 168317.594

Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.58 89.66% 0.00% 10.34% 169025.500
128 1.16 83.62% 0.00% 16.38% 181223.141
256 2.32 87.50% 4.74% 7.76% 164286.656
512 4.63 89.20% 2.16% 8.64% 166228.547
1024 9.27 88.57% 2.48% 8.95% 166622.547
2048 18.54 88.24% 3.13% 8.63% 166032.109

Enter array size (q to quit) [200]:

take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.78 87.18% 5.13% 7.69% 122074.074
128 1.54 89.61% 4.55% 5.84% 121232.184
256 3.09 91.59% 2.59% 5.83% 120815.578
512 6.19 90.79% 2.75% 6.46% 121441.566
1024 12.36 91.10% 2.91% 5.99% 121023.523

Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.77 90.91% 2.60% 6.49% 122074.074
128 1.55 91.61% 2.58% 5.81% 120401.826
256 3.10 90.97% 2.58% 6.45% 121232.184
512 6.20 90.97% 3.06% 5.97% 120608.348
1024 12.40 91.05% 2.66% 6.29% 121023.523

Enter array size (q to quit) [200]:







--- 700MHz ---

take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.64 90.63% 1.56% 7.81% 148971.734
128 1.29 88.37% 2.33% 9.30% 150245.031
256 2.57 87.94% 2.72% 9.34% 150889.688
512 5.14 88.13% 3.11% 8.75% 149924.641
1024 10.28 88.42% 2.63% 8.95% 150245.156

Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.65 89.23% 4.62% 6.15% 144087.297
128 1.30 89.23% 1.54% 9.23% 148971.484
256 2.59 88.03% 2.70% 9.27% 149605.656
512 5.19 91.71% 2.12% 6.17% 144383.328
1024 10.37 90.26% 2.12% 7.62% 146794.453

Enter array size (q to quit) [200]:



take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.86 93.02% 2.33% 4.65% 107186.992
128 1.72 92.44% 2.33% 5.23% 107844.581
256 3.45 91.01% 2.90% 6.09% 108510.288
512 6.87 90.83% 3.20% 5.97% 108846.233
1024 13.75 92.95% 1.38% 5.67% 108426.626

Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.86 93.02% 0.00% 6.98% 109866.667
128 1.74 91.38% 1.72% 6.90% 108510.288
256 3.47 89.63% 4.03% 6.34% 108176.410
512 6.95 90.65% 3.02% 6.33% 108010.241
1024 13.90 90.86% 2.66% 6.47% 108176.410

Enter array size (q to quit) [200]:





コア1つで、この性能なら、組込みCPUとしては、まぁ速い。
4コアなので、OpenMP とかで並列に動けば、まぁまぁいいかも知れない。
だが、キャッシュの食い合いとか、内部のバンド幅のせいで、性能が出ないかも知れない。

OpenMP版Linpack も動かしたが、Flops 値が、今は出せていない。



上記で使用したコンパイラ

take@raspi2% gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.3-14+rpi1' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 4.6.3 (Debian 4.6.3-14+rpi1)






2015/FEB/27 追記

kkojimaさんのコメントにより…
gcc-4.8 に
CFLAGS = -O4 -march=armv7-a -mfpu=vfpv3-d16
で、linpack を makeし計測。

システムクロックは、1000MHzで計測。
単精度はそこはかとなく遅い気がする(700MHzでは、はっきりと遅い)。
が、倍精度は、速いようだ。
これがコンパイラの力か…
でも、単精度が遅くなったら、アカンやん…うーん

--- 1000MHz ---

take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.57 87.72% 0.00% 12.28% 175786.672
128 1.15 88.70% 2.61% 8.70% 167415.891
256 2.31 88.31% 4.33% 7.36% 164286.578
512 4.60 88.91% 2.39% 8.70% 167415.844
1024 9.20 88.70% 2.83% 8.48% 167018.266
2048 18.41 88.21% 2.82% 8.96% 167815.422

Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.58 84.48% 6.90% 8.62% 165835.672
128 1.16 87.07% 3.45% 9.48% 167416.609
256 2.32 88.79% 3.45% 7.76% 164287.234
512 4.65 89.25% 2.80% 7.96% 164287.094
1024 9.28 88.04% 3.02% 8.94% 166425.469
2048 18.57 88.21% 3.02% 8.78% 166031.844

Enter array size (q to quit) [200]:
Memory required: 158K.


LINPACK benchmark, Single precision.
Machine precision: 6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.58 87.93% 3.45% 8.62% 165839.250
128 1.17 86.33% 2.56% 11.11% 169024.250
256 2.32 87.50% 3.45% 9.05% 166623.000
512 4.64 88.58% 2.80% 8.62% 165836.562
1024 9.29 89.13% 1.94% 8.93% 166228.250
2048 18.57 88.37% 2.85% 8.78% 166032.328

Enter array size (q to quit) [200]: q



take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.73 91.78% 1.37% 6.85% 129254.902
128 1.46 90.41% 2.05% 7.53% 130212.346
256 2.92 89.04% 2.40% 8.56% 131675.406
512 5.84 90.07% 2.91% 7.02% 129492.940
1024 11.68 89.81% 2.65% 7.53% 130212.346

Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.73 89.04% 4.11% 6.85% 129254.902
128 1.46 90.41% 2.74% 6.85% 129254.902
256 2.92 89.73% 3.42% 6.85% 129254.902
512 5.84 89.90% 2.74% 7.36% 129971.657
1024 11.68 89.64% 3.00% 7.36% 129971.657

Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 89.19% 4.05% 6.76% 127381.643
128 1.47 91.84% 0.68% 7.48% 129254.902
256 2.94 89.12% 3.40% 7.48% 129254.902
512 5.89 90.15% 2.55% 7.30% 128781.441
1024 11.78 89.81% 2.63% 7.56% 129136.211

Enter array size (q to quit) [200]:
Memory required: 315K.


LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 90.54% 1.35% 8.11% 129254.902
128 1.47 88.44% 4.08% 7.48% 129254.902
256 2.94 88.44% 3.74% 7.82% 129731.857
512 5.89 89.64% 2.89% 7.47% 129017.737
1024 11.77 89.97% 2.63% 7.39% 129017.737

Enter array size (q to quit) [200]: q


take@raspi2% gcc-4.8 -v
Using built-in specs.
COLLECT_GCC=gcc-4.8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.8/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Raspbian 4.8.2-21~rpi3rpi1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-armhf --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 4.8.2 (Raspbian 4.8.2-21~rpi3rpi1)