This book provides a comprehensive overview and introduction to Big Data Infrastructure technologies, existing cloud-based platforms, and tools for Big Data processing and data analytics, combining both a conceptual approach in architecture design and a practical approach in technology selection and project implementation.
Readers will learn the core functionality of major Big Data Infrastructure components and how they integrate to form a coherent solution with business benefits. Specific attention will be given to understanding and using the major Big Data platform Apache Hadoop ecosystem, its main functional components Map Reduce, HBase, Hive, Pig, Spark and streaming analytics. The book includes topics related to enterprise and research data management and governance and explains modern approaches to cloud and Big Data security and compliance.
The book covers two knowledge areas defined in the EDISON Data Science Framework (EDSF): Data Science Engineering and Data Management and Governance and can be used as a textbook for university courses or provide a basis for practitioners for further self-study and practical use of Big Data technologies and competent evaluation and implementation of practical projects in their organizations.
Tabela de Conteúdo
Chapter 1 Introduction. – Chapter 2 Big Data Technologies Foundation: Definition, Reference Architecture, use cases. – Chapter 3 Cloud Computing Foundation: Definition, Reference Architecture, Foundational Technologies, Use cases. – Chapter 4 Cloud and Big Data Service Providers and Platforms. – Chapter 5 Big Data Algorithms, Map Reduce
and Hadoop ecosystem.- Chapter 6 Streaming Analytics and Spark.- Chapter 7 Data Structures for Big Data, Modern Big Data SQL and No SQL Databases.-Chapter 8 Enterprise Data Governance and Management.- Chapter 9 Research Data Management.- Chapter 10 Big Data Security and Compliance, Data Privacy Protection.- Chapter 11 Finding Data on the Web, Data sets, Web Scraping, Web API.- Chapter 12 Data Science Projects Management, Data Ops, MLOPs.- Chapter13 Data Science Projects Development with Amazon Sage Maker.- Chapter 14 Data Validation for Data Science Projects.
Sobre o autor
Dr. Yuri Demchenko is a Senior Researcher and lecturer at the Complex Cyber Infrastructure Research Group of the University of Amsterdam. He graduated from the National Technical University of Ukraine ‘Kyiv Polytechnic Institute’ where he also received his Ph D degree. His main research areas include Data Science and Data Management, Big Data Infrastructure and Technologies for Data Analytics, Dev Sec Ops and general security architectures. He was involved in many European projects such as EGEE, GEANT4, FAIRs FAIR, and SLICES-DS. His current involvement is focused on the building of European SLLICES Research Infrastructure for experimentation on emerging digital technologies in the SLICES-PP project, and developing foundations for improving energy efficiency and reducing the environmental impact of the future digital RIs in the Green DIGIT project. He actively researches the architectural and design aspects of research data management infrastructure for experimental research reproducibility and automation.
J. Cuadrado-Gallego, Ph D is an Associate Professor in the Department of Computer Science at the University of Alcalá, Madrid, Spain, in the area of Computer Science and Artificial Intelligence. He has been a Visiting Associate Professor in the Department of Computer Science and Software Engineering of Concordia University, in Montreal, Canada, and in the Department of Software and IT Engineering of the École de Technologie Supérieure in Montreal, Canada. He has also been Visiting Professor, in the National Polytechnic Institute, in Mexico City, Mexico. Juan J. Cuadrado-Gallego is an MRes, MSc, and BSc in Physics from the Complutense University of Madrid, Spain and Ph D in Computer Science from the Carlos III University of Madrid. In 2010, she obtained the Outstanding Research Pathway certification by the National Agency for Evaluation and Prospective of the Ministry of Science and Innovation, within the program I3 Program. Dr. Cuadrado-Gallego has carried out research stays at the University of Amsterdam, The Netherlands; the Otto-von-Guericke-University, Magdeburg, Germany; the University of Reading, UK; and the Università Roma Tre, in Rome, Italy.
Prof. Dr. Oleg Chertov is the Head of the Applied Mathematics Department at the National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” and the author of the textbook “Calculus for Programmers” (2017). He received his Master’s degree in Applied Mathematics (1987) and a Ph D degree in Engineering Sciences (1991) from the same university. He is a Habil. Dr. (Doctor in Engineering Sciences, 2014) from the Institute of Mathematical Machines and Systems Problems of the Ukraine National Academy of Science. He was a university project coordinator in some Horizon2020 and NATO Science for Peace & Security projects and a consultant for the World Bank and the United Nations Population Fund for some Big Data projects. He is interested in Official Statistics, Data Mining & Machine Learning, and Information Security (Group Anonymity).
Dr. Marharyta Aleksandrova is an Applied Scientist at Amazon Luxembourg. She received her master’s degree from the National Technical University of Ukraine ‘Igor Sikorsky Kyiv Polytechnic Institute’, and a double Ph D from the same university and the University of Lorraine, France. After completing her Ph D, she was a postdoc at the University of Luxembourg, where she worked on multiple research projects and started a new research direction in her hosting group. At Amazon, she works on various projects that contribute to smooth transportation execution. Her research interests and experience include recommender systems, application of ML to security, causal ML, prediction with accuracy guarantees, and optimization. In her current role, she also got exposed to industrial-level problem scales and coding standards.