Raspi2 Floating Performance

Raspberry Pi 2を動かした
Dhrystone など、整数性能は、クロック比程度の性能向上。
Raspi1は、ARM11@700MHz で、Raspi2 は、ARM Cortex-A7@1000MHz。
ARM11は、なかなかいい機械だったので、Cortex-A7と比べてもあまり見劣りしない。

浮動小数点演算性能が気になる。
組込みCPUなので、期待をしてはいけないが…
Raspi1 の FPU は、ARMの VFP で、ショート・ベクトルとか言っているが、
どないもこないも、遅いしろものであった…

Raspi2 のFPUは、また別物。

Raspbian 標準の gcc (gcc 4.6.3) に
CFLAGS = -O4 -mfp=3 -march=armv7-a -mfpu=vfpv3-d16
で、linpack を make。

システムクロックは、1000MHz と標準の700MHz で計測。

--- 1000MHz ---


take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.58  86.21%  10.34%   3.45%  156952.375
     128   1.16  91.38%   0.00%   8.62%  165836.516
     256   2.32  88.79%   1.72%   9.48%  167415.844
     512   4.62  89.18%   3.46%   7.36%  164286.562
    1024   9.27  87.59%   3.02%   9.39%  167415.984
    2048  18.54  88.03%   2.10%   9.87%  168317.594

Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.58  89.66%   0.00%  10.34%  169025.500
     128   1.16  83.62%   0.00%  16.38%  181223.141
     256   2.32  87.50%   4.74%   7.76%  164286.656
     512   4.63  89.20%   2.16%   8.64%  166228.547
    1024   9.27  88.57%   2.48%   8.95%  166622.547
    2048  18.54  88.24%   3.13%   8.63%  166032.109

Enter array size (q to quit) [200]:  

take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.78  87.18%   5.13%   7.69%  122074.074
     128   1.54  89.61%   4.55%   5.84%  121232.184
     256   3.09  91.59%   2.59%   5.83%  120815.578
     512   6.19  90.79%   2.75%   6.46%  121441.566
    1024  12.36  91.10%   2.91%   5.99%  121023.523

Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.77  90.91%   2.60%   6.49%  122074.074
     128   1.55  91.61%   2.58%   5.81%  120401.826
     256   3.10  90.97%   2.58%   6.45%  121232.184
     512   6.20  90.97%   3.06%   5.97%  120608.348
    1024  12.40  91.05%   2.66%   6.29%  121023.523

Enter array size (q to quit) [200]:

--- 700MHz ---


take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.64  90.63%   1.56%   7.81%  148971.734
     128   1.29  88.37%   2.33%   9.30%  150245.031
     256   2.57  87.94%   2.72%   9.34%  150889.688
     512   5.14  88.13%   3.11%   8.75%  149924.641
    1024  10.28  88.42%   2.63%   8.95%  150245.156

Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.65  89.23%   4.62%   6.15%  144087.297
     128   1.30  89.23%   1.54%   9.23%  148971.484
     256   2.59  88.03%   2.70%   9.27%  149605.656
     512   5.19  91.71%   2.12%   6.17%  144383.328
    1024  10.37  90.26%   2.12%   7.62%  146794.453

Enter array size (q to quit) [200]: 



take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.86  93.02%   2.33%   4.65%  107186.992
     128   1.72  92.44%   2.33%   5.23%  107844.581
     256   3.45  91.01%   2.90%   6.09%  108510.288
     512   6.87  90.83%   3.20%   5.97%  108846.233
    1024  13.75  92.95%   1.38%   5.67%  108426.626

Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.86  93.02%   0.00%   6.98%  109866.667
     128   1.74  91.38%   1.72%   6.90%  108510.288
     256   3.47  89.63%   4.03%   6.34%  108176.410
     512   6.95  90.65%   3.02%   6.33%  108010.241
    1024  13.90  90.86%   2.66%   6.47%  108176.410

Enter array size (q to quit) [200]:

コア1つで、この性能なら、組込みCPUとしては、まぁ速い。
4コアなので、OpenMP とかで並列に動けば、まぁまぁいいかも知れない。
だが、キャッシュの食い合いとか、内部のバンド幅のせいで、性能が出ないかも知れない。

OpenMP版Linpack も動かしたが、Flops 値が、今は出せていない。

上記で使用したコンパイラ


take@raspi2% gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.3-14+rpi1' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 4.6.3 (Debian 4.6.3-14+rpi1)

2015/FEB/27 追記

kkojimaさんのコメントにより…
gcc-4.8 に
CFLAGS = -O4 -march=armv7-a -mfpu=vfpv3-d16
で、linpack を makeし計測。

システムクロックは、1000MHzで計測。
単精度はそこはかとなく遅い気がする(700MHzでは、はっきりと遅い)。
が、倍精度は、速いようだ。
これがコンパイラの力か…
でも、単精度が遅くなったら、アカンやん…うーん


--- 1000MHz ---

take@raspi2% ./linpackc_sp
Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.57  87.72%   0.00%  12.28%  175786.672
     128   1.15  88.70%   2.61%   8.70%  167415.891
     256   2.31  88.31%   4.33%   7.36%  164286.578
     512   4.60  88.91%   2.39%   8.70%  167415.844
    1024   9.20  88.70%   2.83%   8.48%  167018.266
    2048  18.41  88.21%   2.82%   8.96%  167815.422

Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.58  84.48%   6.90%   8.62%  165835.672
     128   1.16  87.07%   3.45%   9.48%  167416.609
     256   2.32  88.79%   3.45%   7.76%  164287.234
     512   4.65  89.25%   2.80%   7.96%  164287.094
    1024   9.28  88.04%   3.02%   8.94%  166425.469
    2048  18.57  88.21%   3.02%   8.78%  166031.844

Enter array size (q to quit) [200]:  
Memory required:  158K.


LINPACK benchmark, Single precision.
Machine precision:  6 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.58  87.93%   3.45%   8.62%  165839.250
     128   1.17  86.33%   2.56%  11.11%  169024.250
     256   2.32  87.50%   3.45%   9.05%  166623.000
     512   4.64  88.58%   2.80%   8.62%  165836.562
    1024   9.29  89.13%   1.94%   8.93%  166228.250
    2048  18.57  88.37%   2.85%   8.78%  166032.328

Enter array size (q to quit) [200]:  q



take@raspi2% ./linpackc_dp
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.73  91.78%   1.37%   6.85%  129254.902
     128   1.46  90.41%   2.05%   7.53%  130212.346
     256   2.92  89.04%   2.40%   8.56%  131675.406
     512   5.84  90.07%   2.91%   7.02%  129492.940
    1024  11.68  89.81%   2.65%   7.53%  130212.346

Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.73  89.04%   4.11%   6.85%  129254.902
     128   1.46  90.41%   2.74%   6.85%  129254.902
     256   2.92  89.73%   3.42%   6.85%  129254.902
     512   5.84  89.90%   2.74%   7.36%  129971.657
    1024  11.68  89.64%   3.00%   7.36%  129971.657

Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  89.19%   4.05%   6.76%  127381.643
     128   1.47  91.84%   0.68%   7.48%  129254.902
     256   2.94  89.12%   3.40%   7.48%  129254.902
     512   5.89  90.15%   2.55%   7.30%  128781.441
    1024  11.78  89.81%   2.63%   7.56%  129136.211

Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  90.54%   1.35%   8.11%  129254.902
     128   1.47  88.44%   4.08%   7.48%  129254.902
     256   2.94  88.44%   3.74%   7.82%  129731.857
     512   5.89  89.64%   2.89%   7.47%  129017.737
    1024  11.77  89.97%   2.63%   7.39%  129017.737

Enter array size (q to quit) [200]:  q


take@raspi2% gcc-4.8 -v
Using built-in specs.
COLLECT_GCC=gcc-4.8
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.8/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Raspbian 4.8.2-21~rpi3rpi1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-armhf --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 4.8.2 (Raspbian 4.8.2-21~rpi3rpi1)

たけおかぼちぼち日記

思いついたらメモ

Raspi2 Floating Performance