Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework
Key Features- Master the art of real-time big data processing and machine learning
- Explore a wide range of use-cases to analyze large data
- Discover ways to optimize your work by using many features of Spark 2.x and Scala
Book Description
Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark’s functionality and building your own data flow and machine learning programs on this platform.
You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using Data Frames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools.
By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle.
This Learning Path includes content from the following Packt products:
- Mastering Apache Spark 2.x by Romeo Kienzler
- Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla
- Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei Cookbook
What you will learn
- Get to grips with all the features of Apache Spark 2.x
- Perform highly optimized real-time big data processing
- Use ML and DL techniques with Spark MLlib and third-party tools
- Analyze structured and unstructured data using Spark SQL and Graph X
- Understand tuning, debugging, and monitoring of big data applications
- Build scalable and fault-tolerant streaming applications
- Develop scalable recommendation engines
Who this book is for
If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.
Romeo Kienzler works as the chief data scientist in the IBM Watson Io T worldwide team, helping clients to apply advanced machine learning at scale on their Io T sensor data. He holds a Master’s degree in computer science from the Swiss Federal Institute of Technology, Zurich, with a specialization in information systems, bioinformatics, and applied statistics. Md. Rezaul Karim is a Research Scientist at Fraunhofer FIT, Germany. He is also a Ph D candidate at RWTH Aachen University, Aachen, Germany. He has more than 8 years’ experience in the area of research and development with a solid understanding of algorithms and data structures in C, C++, Java, Scala, R, and Python. Sridhar Alla is a big data expert helping companies solve complex problems in distributed computing, large scale data science and analytics practice. He holds a bachelor’s in computer science from JNTU, India. He loves writing code in Python, Scala, and Java. He also has extensive hands-on knowledge of several Hadoop-based technologies, Tensor Flow, No SQL, Io T, and deep learning. Siamak Amirghodsi (Sammy) is interested in building advanced technical teams, executive management, Spark, Hadoop, big data analytics, AI, deep learning nets, Tensor Flow, cognitive models, swarm algorithms, real-time streaming systems, quantum computing, financial risk management, trading signal discovery, econometrics, long-term financial cycles, Io T, blockchain, probabilistic graphical models, cryptography, and NLP. Meenakshi Rajendran is experienced in the end-to-end delivery of data analytics and data science products for leading financial institutions. Meenakshi holds a master’s degree in business administration and is a certified PMP with over 13 years of experience in global software delivery environments. Her areas of research and interest are Apache Spark, cloud, regulatory data governance, machine learning, Cassandra, and managing global data teams at scale. Broderick Hall is a hands-on big data analytics expert and holds a master’s degree in computer science with 20 years of experience in designing and developing complex enterprise-wide software applications with real-time and regulatory requirements at a global scale. He is a deep learning early adopter and is currently working on a large-scale cloud-based data platform with deep learning net augmentation. Shuen Mei is a big data analytic platforms expert with 15+ years of experience in designing, building, and executing large-scale, enterprise-distributed financial systems with mission-critical low-latency requirements. He is certified in the Apache Spark, Cloudera Big Data platform, including Developer, Admin, and HBase. He is also a certified AWS solutions architect with emphasis on peta-byte range real-time data platform systems.