TensorFlow GPUイメージをDocker起動　〜　XLA_GPU_JIT エラー対処

今から10ヶ月前、TensorFlowをimportするとIllegal Instructionが発生すると云う記事を書いた。

その理由は、プリ・ビルトされているTensorFlowがAVX instructionsを使用しており、古いCPUでは、Illegal Instructionとなる。

今回、Z620ワークステーションを買った理由の一つが、比較的新しいCPUを持った物に変更したかったから。

このZ620ワークステーションは、前の記事に書いた通り、元々実装されていたQuador K2000で画面表示し、追加で実装したGeForce GTX TITANには表示はさせていない。

（1）GPU版TensorFlowイメージを起動する

$ sudo docker run --runtime=nvidia -it tensorflow/tensorflow:latest-gpu bash

（2）本家ページのサンプルをコマンドラインから実行する

# python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
※ Illegal Instruction も発生せずtensorflowがインポートできた。

>>> tf.enable_eager_execution()
>>> print(tf.reduce_sum(tf.random_normal([1000, 1000])))

※ この時、次のようなエラーとなった。

2019-05-05 00:50:57.479030: I tensorflow/compiler/xla/service/platform_util.cc:194] StreamExecutor cuda device (1) is of insufficient compute capability: 3.5 required, device is 3.0
2019-05-05 00:50:57.491674: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3f995e0 executing computations on platform CUDA. Devices:
2019-05-05 00:50:57.491700: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX TITAN, Compute Capability 3.5
Traceback (most recent call last):
・・・

tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
while setting up XLA_GPU_JIT device number 1
>>>

複数GPUが実装された環境では、tensorflowが管理するGPUリストと実際に使用するGPUとの対応が上手くいかないのかな？と勝手な想像。

（3）GPUを指定してDocker起動する

自分の環境では、GTX Titanがゼロ番のようなので、GTX Titanを使うように次のようにDockerを起動する。

$ sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -it tensorflow/tensorflow:latest-gpu bash

そこで、同様に次のようにコマンドラインからサンプルを試す。

# python

>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> print(tf.reduce_sum(tf.random_normal([1000, 1000])))

・・・

tf.Tensor(-1604.2833, shape=(), dtype=float32)
>>>

※ 環境変数、NVIDIA_VISIBLE_DEVICESに使用するGPUの番号を指定して、コンテナに渡す。

因みに、自分の環境でGPU番号を1として、Quadro K2000を指定すると、必要なcompute capabilityは、3.5以上であり、Quadro K2000は3.0なので、「no supported devices found for platform CUDA」とのメッセージでアボート（core dumped）した。

そのうち、このページのSolutions to common ML problemsにチャレンジしてみたい。

TensorFlow GPUイメージをDocker起動 〜 XLA_GPU_JIT エラー対処

TensorFlow GPUイメージをDocker起動　〜　XLA_GPU_JIT エラー対処