报告题目:Data Glitches = Constraint Violations – Empirical Explanations
报告人: Divesh Srivastava, ACM Fellow, the head of Database Research,
AT&T Labs-Research
主持人:Professor Xuemin Lin
报告时间:2017年12月16日09:50-10:40
报告地点:华东师大中北校区理科大楼A510
报告摘要:
Data glitches are unusual observations that do not conform to data quality expectations, be they semantic or syntactic, logical or statistical. By naively applying integrity constraints, potentially large amounts of data could be flagged as being violations. Ignoring or repairing significant amounts of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large volumes and varieties of data from disparate sources are integrated, it is likely that significant portions of these violations are actually legitimate usable data. We conjecture that empirical glitch explanations – concise characterizations of subsets of violating data – could be used to (a) identify legitimate data and release them back into the pool of clean data, thereby reduce cleaning-related statistical distortion of the data; and (b) refine existing integrity constraints and generate improved domain knowledge. We present a few real-world case studies in support of our conjecture, outline scalable techniques to address the challenges of discovering explanations, and demonstrate the utility of the explanations in reclaiming over 99% of the violating data.
报告人简介:
Divesh Srivastava is the head of Database Research at AT&T Labs-Research. He is a Fellow of the Association for Computing Machinery (ACM) and the managing editor of the Proceedings of the VLDB Endowment (PVLDB). His research interests and publications span a variety of topics in data management. He received his Ph.D. from the University of Wisconsin, Madison, USA, and his Bachelor of Technology from the Indian Institute of Technology, Bombay, India.