报告题目:Fault-tolerance techniques for High Performance Computing
报告人:Yves Robert 教授
主持人:王长波
报告时间:5月3日(周三)13:30—14:30
报告地点:中北校区数学馆201室
报告人介绍:
Yves Robert received the PhD degree from Institut National Polytechnique de Grenoble. He is currently a full professor in the Computer Science Laboratory LIP at ENS Lyon. He is the author of 7 books, 150 papers published in international journals, and 240 papers published in international conferences. He is the editor of 11 book proceedings and 13 journal special issues. He is the advisor of 30 PhD theses. His main research interests are scheduling techniques and resilient algorithms for large-scale platforms. He is a Fellow of the IEEE. He has been elected a Senior Member of Institut Universitaire de France. He has been awarded the 2014 IEEE TCSC Award for Excellence in Scalable Computing, and the 2016 IEEE TCPP Outstanding Service Award. He holds a Visiting Scientist position at the University of Tennessee Knoxville since 2011.
报告摘要:
This talk will provide an overview of fault-tolerance techniques for High Performance Computing at very large scale. We first address fail-stop errors, a.k.a. unrecoverable failures, and discuss various checkpoint protocols. We then discuss silent errors, a.k.a. silent data corruptions, and present several detection/correction mechanisms.