Tensorflow 2.0 AMD support · Issue # 362 · ROCmSoftwarePlatform / tensorflow-upstream, Hacker News

Hi@ Cvikli, let’s step back a bit and look at your system configuration:

******** 4x SAPPHIRE Radeon VII2x G.SKILL FlareX 64 GB1x Thermaltake Toughpower 1500 W Gold

The typical gold workstation power supply would run at 87% efficiency at full load, therefore it can supposedly power up to 1307 W.
TR 2950 x TDP is measured at 180 W, Radeon VII TDP is 300 W, but the peak power consumption can go up to 321 .8W (according to third-party measurementhere).
Considering the other components on your workstation, the current 1500 W is not sufficient for your system at full load. We’d recommend you to go for 1800 W PSU or dual 1000 W PSU for your system provide sufficient juices for 4 Radeon VII GPUs.

********

2019 – 05 – 12 15: (******************************************************************: 04. 632396: E tensorflow / stream_executor / rocm / rocm_driver.cc: 629] failed to allocate 14. (G) 16049923584 bytes) from device: hipError_t (1002)

The above error message indicates the target GPU device memory has already been allocated by the other processes.
There’re a couple of solutions to expose only selected GPUs to the user process:

Use HIP_VISIBLE_DEVICES environment variable to select the target GPUs for the process from the HIP level. e.g. use the following to select the first GPU:

export HIP_VISIBLE_DEVICES=0

Use ROCR_VISIBLE_DEVICES environment variable to select the target GPUs from the ROCr (ROCm user-bit driver) level. e.g. the following to select the first GPU:

export ROCR_VISIBLE_DEVICES=0

Pass selected GPU driver interfaces (/ dev / dri / render #)) to Docker container. e.g. use the following docker run command option to select the first GPU:

sudo docker run -it --network=host --device=/ dev / kfd --device=/ dev / dri / renderD 128 --group-add video
Note you show see the following four interfaces for your 4xRadeon VII system:
$ ls / dev / dri / render *
/ dev / dri / renderD 128 / dev / dri / renderD 129 / dev / dri / renderD 130 / dev / dri / renderD 131

We recommend approach# 3, as that would isolate the GPUs at a relatively lower level of the ROCm stack.

For your concern on mGPU performance, could you provide the exact commands to reproduce your observations?

Just FYI, we have been actively running regressions tests for single node multi-GPU performance, and there’s no mGPU performance regression issue reported for TF1. 13 on ROCm2.4 release.
After you can resolve the concern on the power supply, for tf_cnn_benchmarks resnet 50 as an example, you should be able to see near-linear scalability on FP 32 using the following command with 4 GPUs:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 - model=resnet 50 --optimizer=sgd --num_batches=100 --variable_update=replicated --nodistortions --gpu_thread_mode=gpu_shared - num_gpus=4 --all_reduce_spec=pscpu --print_training_accuracy=True --display_every=10