报告题目:A Better Machine Learning Serving Platform And More
报告人: Yao Lu 研究员 Microsoft Research Redmond
主持人: 浦 剑
报告时间:11月23日 周五14:00-16:00
报告地点:中北校区理科大楼B222
报告人简介:
Yao Lu is a researcher in the Data Management, Exploration and Mining (DMX) group, Microsoft Research Redmond Lab. His interests are artificial intelligence, machine learning and vision. Of late, He is working on the intersection of ML and systems to build big-data platforms for large-scale machine learning, as well as using machine learning to optimize existing data systems. Yao got his PhD degree from Paul G. Allen School of Computer Science and Engineering, University of Washington. His work received the ICME 2017 best paper and SIGMOD 2018 best demonstration awards.
报告摘要:
Machine learning has become an important customer for recent big-data platforms such as Azure Cosmos DB. In such systems, people start to build ML algorithms as user-defined functions (UDFs) upon relational platforms; in this way, ML pipelines can be processed in a way that some relational query optimization tricks, such as auto-parallelization and query de-duplication, can be applied. However, many other query optimization techniques, including predicate pushdown, are of limited use to ML inference. This is because the UDFs which extract relational columns from unstructured inputs are often expensive; query predicates will remain stuck behind these UDFs if they happen to require relational columns that are generated by the UDFs. We propose constructing and applying probabilistic predicates to filter data blobs that do not satisfy the query predicate. To support complex predicates and to avoid per-query training, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments with several ML workloads show that query processing further improves by as much as 10x.
On the other hand, machine learning has opened great oppotunities for big-data systems. This talk will also cover a few open and hot topics regarding to improve different aspects of big-data systems using machine learning.