This book addresses several knowledge discovery problems on multi-sourced data where the theories, techniques, and methods in data cleaning, data mining, and natural language processing are synthetically used. This book mainly focuses on three data models: the multi-sourced isomorphic data, the multi-sourced heterogeneous data, and the text data. On the basis of three data models, this book studies the knowledge discovery problems including truth discovery and fact discovery on multi-sourced data from four important properties: relevance, inconsistency, sparseness, and heterogeneity, which is useful for specialists as well as graduate students.
Data, even describing the same object or event, can come from a variety of sources such as crowd workers and social media users. However, noisy pieces of data or information are unavoidable. Facing the daunting scale of data, it is unrealistic to expect humans to “label” or tell which data source is more reliable.Hence, it is crucial to identify trustworthy information from multiple noisy information sources, referring to the task of knowledge discovery.
At present, the knowledge discovery research for multi-sourced data mainly faces two challenges. On the structural level, it is essential to consider the different characteristics of data composition and application scenarios and define the knowledge discovery problem on different occasions. On the algorithm level, the knowledge discovery task needs to consider different levels of information conflicts and design efficient algorithms to mine more valuable information using multiple clues. Existing knowledge discovery methods have defects on both the structural level and the algorithm level, making the knowledge discovery problem far from totally solved.
Зміст
1. Introduction.- 2. Functional-dependency-based truth discovery for isomorphic data.- 3. Denial-constraint-based truth discovery for isomorphic data.- 4. Pattern discovery for heterogeneous data.- 5. Deep fact discovery for text data.
Про автора
Chen Ye is currently an Associate Researcher at the School of Computer Science and Technology, Hangzhou Dianzi University, China. She received the Ph.D. degree in Computer Software and Theory from Harbin Institute of Technology, China. Her current research interests include data repairing, truth discovery, and crowdsourcing. She has won the ACM SIGMOD China Doctoral Dissertation Award in 2020.
Hongzhi Wang is a Professor and Doctoral Supervisor at the School of Computer Science and Technology, Harbin Institute of Technology, China. His research interests include big data management and analysis, data quality, graph data management, and web data management. He has published more than 150 papers, and he is the Primary Investigator of more than 10 projects including three NSFC projects, and co-PI of 973, 863, and NSFC key projects. He was awarded as Microsoft fellowship, China Excellent Database Engineer, and IBM Ph.D. fellowship.
Guojun Dai is now working in the School of Computer Science and Technology of Hangzhou Dianzi University, as the Head of the National Brain-Computer Collaborative Intelligent Technology International Joint Research Center, the director of the Institute of Computer Application Technology. His research interests include Internet of Things, industrial big data, network collaborative manufacturing, edge computing, brain-computer interface, cognitive computing, artificial intelligence. He has published over 50 research papers in top-quality international conferences and journals, particularly, INFOCOM, IEEE Transactions on Industrial Informatics, and IEEE Transactions on Mobile Computing.