Description
Doug Cutting, the creator of Hadoop, likes to call Hadoop the kernel for big data, and I would tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop provides a bridge between structured (RDBMS) and unstructured (log files, XML, text) data and allows these datasets to be easily joined together. This has evolved from traditional use cases, such as combining OLTP and log files, to more sophisticated uses, such as using Hadoop for data warehousing (exemplified by Facebook) and the field of data science, which studies and makes new discoveries about data.
This book collects a number of intermediary and advanced Hadoop examples and presents them in a problem/solution format. Each technique addresses a specific task you’ll face, like using Flume to move log files into Hadoop or using Mahout for predictive analysis. Each problem is explored step by step, and as you work through them, you’ll find yourself growing more comfortable with Hadoop and at home in the world of big data.
This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.
Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition by Joshua Bloch (Addison-Wesley, 2008).
Roadmap
This book has 10 chapters divided into four parts.
Part 1 contains two chapters that form the introduction to this book. They review Hadoop basics and look at how to get Hadoop up and running on a single host. YARN, which is new in Hadoop version 2, is also examined, and some operational tips are provided for performing basic functions in YARN.
Part 2, “Data logistics,” consists of three chapters that cover the techniques and tools required to deal with data fundamentals, how to work with various data formats, how to organize and optimize your data, and getting data into and out of Hadoop. Picking the right format for your data and determining how to organize data in HDFS are the first items you’ll need to address when working with Hadoop, and they’re covered in chapters 3 and 4 respectively. Getting data into Hadoop is one of the bigger hurdles commonly encountered when working with Hadoop, and chapter 5 is dedicated to looking at a variety of tools that work with common enterprise data sources.
Part 3 is called “Big data patterns,” and it looks at techniques to help you work effectively with large volumes of data. Chapter 6 covers how to represent data such as graphs for use with MapReduce, and it looks at several algorithms that operate on graph data. Chapter 7 looks at more advanced data structures and algorithms such as graph processing and using HyperLogLog for working with large datasets. Chapter 8 looks at how to tune, debug, and test MapReduce performance issues, and it also covers a number of techniques to help make your jobs run faster.
Part 4 is titled “Beyond MapReduce,” and it examines a number of technologies that make it easier to work with Hadoop. Chapter 9 covers the most prevalent and promising SQL technologies for data processing on Hadoop, and Hive, Impala, and Spark SQL are examined. The final chapter looks at how to write your own YARN application, and it provides some insights into some of the more advanced features you can use in your applications.
The appendix covers instructions for the source code that accompanies this book, as well as installation instructions for Hadoop and all the other related technologies covered in the book.
Contents
Copyright
Brief Table of Contents
Table of Contents
Praise for the First Edition of Hadoop in Practice
Preface
Acknowledgments
About this Book
About the Cover Illustration
1. Background and fundamentals
Chapter 1. Hadoop in a heartbeat
Chapter 2. Introduction to YARN
2. Data logistics
Chapter 3. Data serialization—working with text and beyond
Chapter 4. Organizing and optimizing data in HDFS
Chapter 5. Moving data into and out of Hadoop
3. Big data patterns
Chapter 6. Applying MapReduce patterns to big data
Chapter 7. Utilizing data structures and algorithms at scale
Chapter 8. Tuning, debugging, and testing
4. Beyond MapReduce
Chapter 9. SQL on Hadoop
Chapter 10. Writing a YARN application
Installing Hadoop and friends
Index
List of Figures
List of Tables
List of Listings