Hands-On Guide to Apache Hadoop and Apache Spark: A Beginner’s Guide by Alfonso Antolinez Garcia
English | November 30, 2023 | ISBN: N/A | ASIN: B0CP8VMR27 | 77 pages | EPUB | 2.16 Mb
English | November 30, 2023 | ISBN: N/A | ASIN: B0CP8VMR27 | 77 pages | EPUB | 2.16 Mb
Apache Hadoop and Apache Spark are the two main frameworks in the world of big data processing and analytics.
Apache Hadoop allows for the distributed processing of large data sets across clusters of computers using commodity hardware. It is designed to scale up to thousands of servers each sharing local computation and storage. It is designed to deliver highly-availability handling failures at the application layer. Hadoop has four major components such as Hadoop Common, HDFS, YARN, and MapReduce.
In contrast, Apache Spark was crafted for rapid data processing using in-memory processing. Spark comes with many built in libraries to deal with big data at scale challenges, including machine learning, and both batch and streaming data processing. Apache Spark’s fundamental data abstraction is the resilient distributed dataset (RDD). RDDs are a fault-tolerant, immutable, and distributed collection of objects, tailored for distributed processing across thousands of computer nodes in a cluster.
While these two tools may compete in certain tasks, they can complement each other offering compelling additional features. This book guides you through the step-by-step process of running Spark in a Hadoop cluster, taking advantage of the synergies of both frameworks working together.
With this book you will learn and develop some of your technical skills such as,
- Installation and configuration of Hadoop cluster
- Installation and configuration of Apache Spark
- Make Apache Spark run in a YARN cluster