CUDA 9.1 Patch 1 (Released Jan 25, 2018)
Patch 1 (Released Jan 25, 2018) Download (112.9 MB)
cuBLAS Patch Update: This update to CUDA 9.1 includes new GEMM kernels optimized for the Volta architecture and improved heuristics to select GEMM kernels for given input sizes.
前回の続きですが、matrixMulCUBLAS の方は相当速くなりました。。。
パッチ適用前
# ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 3544.62 GFlop/s, Time= 0.055 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
パッチ適用後
# ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 7441.86 GFlop/s, Time= 0.026 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
Patch 1 (Released Jan 25, 2018) Download (112.9 MB)
cuBLAS Patch Update: This update to CUDA 9.1 includes new GEMM kernels optimized for the Volta architecture and improved heuristics to select GEMM kernels for given input sizes.
前回の続きですが、matrixMulCUBLAS の方は相当速くなりました。。。
パッチ適用前
# ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 3544.62 GFlop/s, Time= 0.055 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS
パッチ適用後
# ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
GPU Device 0: "Tesla V100-PCIE-16GB" with compute capability 7.0
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 7441.86 GFlop/s, Time= 0.026 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS