Hadoop is quite a fascinating and interesting project that has seen quite a lot of interest and contributions from the various organizations and institutions. Hadoop has come a long way, from being a batch processing system to a data lake and high-volume streaming analysis in low latency with the help of various Hadoop ecosystem components, specifically YARN. This progress has been substantial and has made Hadoop a powerful system, which can be designed as a storage, transformation, batch processing, analytics, or streaming and real-time processing system.
Hadoop project as a data lake can be divided in multiple phases such as data ingestion, data storage, data access, data processing, and data management. For each phase, we have different sub-projects that are tools, utilities, or frameworks to help and accelerate the process. The Hadoop ecosystem components are tested, configurable and proven and to build similar utility on our own it would take a huge amount of time and effort to achieve. The core of the Hadoop framework is complex for development and optimization. The smart way to speed up and ease the process is to utilize different Hadoop ecosystem components that are very useful, so that we can concentrate more on the application flow design and integration with other systems.
With the emergence of many useful sub-projects in Hadoop and other tools within the Hadoop ecosystem, the question that arises is which tool to use when and how effectively. This book is intended to complete the jigsaw puzzle of when and how to use the various ecosystem components, and to make you well aware of the Hadoop ecosystem utilities and the cases and scenarios where they should be used.
What this book covers
Chapter 1, Introduction to Big Data and Hadoop, covers an overview of big data and Hadoop, plus different use case patterns with advantages and features of Hadoop.
Chapter 2, Hadoop Ecosystem, explores the different phases or layers of Hadoop project development and some components that can be used in each layer.
Chapter 3, Pillars of Hadoop – HDFS, MapReduce, and YARN, is about the three key basic components of Hadoop, which are HDFS, MapReduce, and YARN.
Chapter 4, Data Access Components – Hive and Pig, covers the data access components Hive and Pig, which are abstract layers of the SQL-like and Pig Latin procedural languages, respectively, on top of the MapReduce framework.
Chapter 5, Storage Components – HBase, is about the NoSQL component database HBase in detail.
Chapter 6, Data Ingestion in Hadoop – Sqoop and Flume, covers the data ingestion library tools Sqoop and Flume.
Chapter 7, Streaming and Real-time Analysis – Storm and Spark, is about the streaming