Description
Who Should Read This Book
Hadoop Application Architectures was written for software developers, architects, and project leads who need to understand how to use Apache Hadoop and tools in the Hadoop ecosystem to build end-to-end data management solutions or integrate Hadoop into existing data management architectures. Our intent is not to provide deep dives into specific technologies—for example, MapReduce—as other references do. Instead, our intent is to provide you with an understanding of how components in the Hadoop ecosystem are effectively integrated to implement a complete data pipeline, starting from source data all the way to data consumption, as well as how Hadoop can be integrated into existing data management systems.
We assume you have some knowledge of Hadoop and related tools such as Flume, Sqoop, HBase, Pig, and Hive, but we’ll refer to appropriate references for those who need a refresher. We also assume you have experience programming with Java, as well as experience with SQL and traditional data-management systems, such as relational database-management systems.
So if you’re a technologist who’s spent some time with Hadoop, and are now looking for best practices and examples for architecting and implementing complete solutions with it, then this book is meant for you. Even if you’re a Hadoop expert, we think the guidance and best practices in this book, based on our years of experience working with Hadoop, will provide value.
This book can also be used by managers who want to understand which technologies will be relevant to their organization based on their goals and projects, in order to help select appropriate training for developers.
Why We Wrote This Book
We have all spent years implementing solutions with Hadoop, both as users and supporting customers. In that time, the Hadoop market has matured rapidly, along with the number of resources available for understanding Hadoop. There are now a large number of useful books, websites, classes, and more on Hadoop and tools in the Hadoop ecosystem available. However, despite all of the available materials, there’s still a shortage of resources available for understanding how to effectively integrate these tools into complete solutions.
When we talk with users, whether they’re customers, partners, or conference attendees, we’ve found a common theme: there’s still a gap between understanding Hadoop and being able to actually leverage it to solve problems. For example, there are a number of good references that will help you understand Apache Flume, but how do you actually determine if it’s a good fit for your use case? And once you’ve selected Flume as a solution, how do you effectively integrate it into your architecture? What best practices and considerations should you be aware of to optimally use Flume?
This book is intended to bridge this gap between understanding Hadoop and being able to actually use it to build solutions. We’ll cover core considerations for implementing solutions with Hadoop, and then provide complete, end-to-end examples of implementing some common use cases with Hadoop.
Navigating This Book
The organization of chapters in this book is intended to follow the same flow that you would follow when architecting a solution on Hadoop, starting with modeling data on Hadoop, moving data into and out of Hadoop, processing the data once it’s in Hadoop, and so on. Of course, you can always skip around as needed. Part I covers the considerations around architecting applications with Hadoop, and includes the following chapters:
- Chapter 1 covers considerations around storing and modeling data in Hadoop—for example, file formats, data organization, and metadata management.
- Chapter 2 covers moving data into and out of Hadoop. We’ll discuss considerations and patterns for data ingest and extraction, including using common tools such as Flume, Sqoop, and file transfers.
- Chapter 3 covers tools and patterns for accessing and processing data in Hadoop. We’ll talk about available processing frameworks such as MapReduce, Spark, Hive, and Impala, and considerations for determining which to use for particular use cases.
- Chapter 4 will expand on the discussion of processing frameworks by describing the implementation of some common use cases on Hadoop. We’ll use examples in Spark and SQL to illustrate how to solve common problems such as de-duplication and working with time series data.
- Chapter 5 discusses tools to do large graph processing on Hadoop, such as Giraph and GraphX.
- Chapter 6 discusses tying everything together with application orchestration and scheduling tools such as Apache Oozie.
- Chapter 7 discusses near-real-time processing on Hadoop. We discuss the relatively new class of tools that are intended to process streams of data such as Apache Storm and Apache Spark Streaming.
In Part II, we cover the end-to-end implementations of some common applications with Hadoop. The purpose of these chapters is to provide concrete examples of how to use the components discussed in Part I to implement complete solutions with Hadoop:
- Chapter 8 provides an example of clickstream analysis with Hadoop. Storage and processing of clickstream data is a very common use case for companies running large websites, but also is applicable to applications processing any type of machine data. We’ll discuss ingesting data through tools like Flume and Kafka, cover storing and organizing the data efficiently, and show examples of processing the data.
- Chapter 9 will provide a case study of a fraud detection application on Hadoop, an increasingly common use of Hadoop. This example will cover how HBase can be leveraged in a fraud detection solution, as well as the use of near-real-time processing.
- Chapter 10 provides a case study exploring another very common use case: using Hadoop to extend an existing enterprise data warehouse (EDW) environment. This includes using Hadoop as a complement to the EDW, as well as providing functionality traditionally performed by data warehouses.