研究日誌。

大規模なグラフ処理に対してメモリ階層構造を考慮した高性能なソフトウェアを開発。

NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

2013-08-12 09:12:26 | Weblog
こちらでも紹介いただいているが、IEEE BigData 2013 に提出していた論文 (regular paper) が採択されました。採択率が 17.37% と低いので採択されてほっとしました。

内容は CPU-based の single-node 上での並列幅優先探索に対する NUMA を考慮した高速化です。
・基本的なアルゴリズムは Beamer による Hybrid algorithm
・隣接枝リストに対する列方向分割によって、各スレッドのリモート RAM へのメモリアクセスを大幅に削減
・CPU Affinity と local allocation に関するライブラリを開発し、それをベースに実装
・SCALE 26 の Kronecker graph (点数 2^26, 枝数 2^30) で 11.15 GigaTEPS (TEPS は BFS における 1 秒間辺りの探索枝数) を達成
・ほぼ同じ計算機環境で Beamer の実装 (5.1 GTEPS) に比べ 2 倍強の高速化を達成


"NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System"

has been accepted as a regular paper presentation. This year, we have received 259 submissions, only 45 are accepted as regular papers (acceptance rate 17.37%). Papers went through a rigorous review process. Each paper was reviewed by at least three program committee members (most of them are five and some even more). All the paper decisions are made from the weighted scoring of the reviewing reports.

Title:
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

Author:
Yuichiro Yasui, Katsuki Fujisawa, and Kazushige Goto

Abstract―The breadth-first search (BFS) is one of the most important kernels in graph theory. The Graph500 benchmark measures the performance in traversed edges per second
(TEPS) for each supercomputer performing a BFS. Previous studies have proposed a hybrid approach that combines a wellknown top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges for a large frontier in BFS traversal of a smallworld graph such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the NUMA architecture system. We explicitly manage how each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2^26 vertices and 2^30 edges.