以下の環境 (CentOS 7.5 + CUDA 9.2 + cuDNN 7.1.4 + chainermn 1.3.0 + chainer 4.1.0 + cupy 4.1.0)で Chainermn の性能を見てみました。。。
$time python ./train_mnist.py -e 50 -g 0
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 50
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.190873 0.0904841 0.942334 0.971 5.10388
............
50 0.0038919 0.145653 0.999116 0.9845 163.166
real 2m46.186s
user 2m52.792s
sys 0m33.726s
$ time python ./train_mnist_data_parallel.py -e 50
GPU: 0, 1
# unit: 1000
# Minibatch-size: 400
# epoch: 50
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.284769 0.109672 0.915433 0.9667 3.43092
.................
50 4.96368e-06 0.0928816 1 0.9865 78.3696
real 1m21.280s
user 1m19.968s
sys 0m2.447s
◯計算サーバ
CPU : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz x 2個
メモリ:512GB
GPU : NVIDIA Tesla P100 x 2
OS : CentOS 7.5
$time python ./train_mnist.py -e 50 -g 0
GPU: 0
# unit: 1000
# Minibatch-size: 100
# epoch: 50
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.190873 0.0904841 0.942334 0.971 5.10388
............
50 0.0038919 0.145653 0.999116 0.9845 163.166
real 2m46.186s
user 2m52.792s
sys 0m33.726s
$ time python ./train_mnist_data_parallel.py -e 50
GPU: 0, 1
# unit: 1000
# Minibatch-size: 400
# epoch: 50
epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time
1 0.284769 0.109672 0.915433 0.9667 3.43092
.................
50 4.96368e-06 0.0928816 1 0.9865 78.3696
real 1m21.280s
user 1m19.968s
sys 0m2.447s
◯計算サーバ
CPU : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz x 2個
メモリ:512GB
GPU : NVIDIA Tesla P100 x 2
OS : CentOS 7.5