Defining and Measuring Supercomputer RAS(Cray XT3)

2007-03-13 | SuperComputer

[07/03/15]
情報処理学会システム評価研究会 (SIGEVA)
　http://www.ipsj.or.jp/sig/eva/indexja.html
==========
[07/03/14]
半導体ファブの物流について非常に詳しく解説されている Blog：
ファブ内物流の論理を求めて
　http://d.hatena.ne.jp/CUSCUS/
"Factory Physicsの紹介", 2007-02-26
　http://d.hatena.ne.jp/CUSCUS/20070226
"Factory Physics概要", 2007-01-28
　http://d.hatena.ne.jp/CUSCUS/20070128
"Factory Physicsを勉強する目的", 2007-03-09
　http://d.hatena.ne.jp/CUSCUS/20070309
　"Factory Physicsを勉強する目的は、工場運営に関わるさまざなトレードオフの関係を理解するためだと私は考えます。"
"工場運営" を "計算機サービスの提供" と置き換えられるかもしれません。
これから勉強させて頂きます。
==========
[07/03/13]
Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS),
　Jon Stearley, Sandia National Laboratories.
　http://www.cs.sandia.gov/~jrstear/ras/

Abstract
"The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful
discussion of the issues involved and hinders their solution. This paper seeks to foster a common
basis for communication about supercomputer RAS, by proposing a system state model, definitions,
and measurements. These are modeled after the SEMI-E10 specification which is widely used
in the semiconductor manufacturing industry."

Application to Red Storm:
"The details necessary to apply these concepts to the Red Storm (Cray XT3) supercomputer
were presented at the 2005 Cray Users Group meeting."

SEMI E10-0304
"Specification for Definition and Measurement of Equipment Reliability, Availability, and Maintainability (RAM)",
　This standard was editorially modified in February 2004
　http://downloads.semi.org/pubs/semipubs.nsf/db9ffcaf9db4331488256532005caded/260567640334e08888256516007bdbf2
"半導体製造装置の信頼性，有用性，整備性の定義と測定のための仕様",
　発行年月： 2004年 11月
　"本文書では，製造環境において半導体製造装置のRAMパフォーマンス測定のためのスタンダードを
　提供して，装置のユーザとサプライヤ間の意見の交換に共通の基盤を確立している。"
　http://shop.bookpark.ne.jp/cm/pudding.asp?review=off&content_id=SEMI-JE010-00-0304E-PDF

Stearley氏の関連プロジェクト：
Sisyphus: an event log data-mining toolkit
　http://www.cs.sandia.gov/~jrstear/sisyphus/

関連報告、プロジェクト：
HDD信頼性メモ 2：FAST'07 (Google & CMU) by StorageMojo, 2007-03-11
"Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?",
　Bianca Schroeder and Garth A. Gibson, Carnegie Mellon University.
　FAST'07での発表
Analyzing Failure Data, Parallel Data Lab, School of Computer Science, Carnegie Mellon University.
Empirical System Reliability, Parallel Data Lab, School of Computer Science, Carnegie Mellon University

最新の画像［もっと見る］

コメントを投稿

ブログ作成者から承認されるまでコメントは反映されません。

goo blog お知らせ

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！
	goo blogは20周年を迎えました！

徒然なるままに

Mail: topography "AT" mail.goo.ne.jp

Defining and Measuring Supercomputer RAS(Cray XT3)

コメントを投稿

2024年7月
日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31