昨年の SC11 で発表された以下の GPU 用の dgemm 関数を用いて、各 GPU の dgemm 性能を測定した。
Fast Implementation of DGEMM on Fermi GPU
◯ Tesla C2075 : 360GFlops 程度の性能が出ている。cublas では 315GFlops 程度
# Running on 'Tesla C2075'
# SMs = 14
# clock = 1147000
# memory = 5636554752 (5553774592 free)
atrans btrans M N K gflops cublas
N T 4096 4096 768 355.677 311.394
N T 4096 4096 1024 357.225 313.001
N T 4096 4096 2048 359.896 315.351
N T 4096 4096 4096 361.272 314.896
N T 4096 4096 8192 361.961 293.275
N T 8192 8192 768 356.311 311.665
N T 8192 8192 1024 357.551 313.253
N T 8192 8192 2048 360.210 315.614
N T 8192 8192 4096 361.555 293.124
N T 8192 8192 8192 362.258 293.341
◯ GeForce GTX 580 : 200GFlops 弱の性能で cublas との性能差は少ない
# Running on 'GeForce GTX 580'
# SMs = 16
# clock = 1544000
# memory = 1609760768 (1489625088 free)
atrans btrans M N K gflops cublas
N T 4096 4096 768 196.865 196.606
N T 4096 4096 1024 197.025 196.716
N T 4096 4096 2048 197.345 196.885
N T 4096 4096 4096 197.485 196.964
N T 4096 4096 8192 197.561 195.068
N T 8192 8192 768 196.870 196.673
N T 8192 8192 1024 197.046 196.763
N T 8192 8192 2048 197.342 196.945
N T 8192 8192 4096 197.483 195.130
◯ GeForce GTX 480 : これも cublas との性能差は少ない
# Running on 'GeForce GTX 480'
# SMs = 15
# clock = 1401000
# memory = 1609760768 (1529470976 free)
atrans btrans M N K gflops cublas
N T 4096 4096 768 167.022 167.253
N T 4096 4096 1024 167.169 167.346
N T 4096 4096 2048 167.415 167.490
N T 4096 4096 4096 167.532 167.569
N T 4096 4096 8192 167.595 165.462
N T 8192 8192 768 167.480 167.393
N T 8192 8192 1024 167.624 167.478
N T 8192 8192 2048 167.875 167.609
N T 8192 8192 4096 167.993 165.899
◯ GeForce GTX 460 : もはや CPU の方が速い
# Running on 'GeForce GTX 460'
# SMs = 7
# clock = 1400000
# memory = 1072889856 (986570752 free)
atrans btrans M N K gflops cublas
N T 4096 4096 768 76.060 75.477
N T 4096 4096 1024 76.166 75.567
N T 4096 4096 2048 76.325 75.693
N T 4096 4096 4096 76.404 75.751
N T 4096 4096 8192 76.441 75.611
N T 8192 8192 768 76.152 75.496
N T 8192 8192 1024 76.264 75.582
N T 8192 8192 2048 76.420 75.705
◯ Tesla C1060 : 時間測定の値が異常。cublas の値を見た感じでは GeForce GTX 460 以下の性能
# Running on 'Tesla C1060'
# SMs = 30
# clock = 1296000
# memory = 4294770688 (4237299456 free)
atrans btrans M N K gflops cublas
N T 512 512 768 242136.625 65.597
N T 512 512 1024 342559.344 65.784
N T 512 512 2048 633257.062 65.998
N T 512 512 4096 1316020.750 65.999
N T 512 512 8192 2796373.250 66.181
N T 1024 1024 768 1027845.250 72.169
N T 1024 1024 1024 1291185.250 72.253
N T 1024 1024 2048 2739806.250 72.417
N T 1024 1024 4096 5593088.000 72.513
N T 1024 1024 8192 11185493.000 72.547
N T 2048 2048 768 4111381.000 73.763
N T 2048 2048 1024 5595136.000 73.803
N T 2048 2048 2048 10959225.000 73.916
N T 2048 2048 4096 22848358.000 73.961
N T 2048 2048 8192 44741972.000 73.990
N T 3072 3072 768 9250607.000 74.131
N T 3072 3072 1024 12332137.000 74.188
N T 3072 3072 2048 24658256.000 74.299
N T 3072 3072 4096 47376748.000 74.348
N T 3072 3072 8192 98614968.000 74.373
N T 4096 4096 768 15800601.000 74.226
N T 4096 4096 1024 21923798.000 74.285
N T 4096 4096 2048 42117804.000 74.374
N T 4096 4096 4096 89489408.000 74.422
N T 4096 4096 8192 162084128.000 74.443
N T 8192 8192 768 67152552.000 74.347
N T 8192 8192 1024 85941288.000 74.400
N T 8192 8192 2048 179000656.000 74.486