The story behind Apache Hadoop is a very interesting one and it would not be told properly without giving a brief history of the world wide web and search engines that stemmed up around that time.
In the late 1990s and early 2000s, as the World Wide Web increased in popularity, search engines and indexes were developed to aid in the discovery of important information among the text-based content. Humans used to return search results in the beginning. However, when the web grew from a few hundred to millions of sites, automation became necessary. Crawlers for the web were developed, many as university-led research projects and search engine start-ups exploded (Yahoo, Bing, etc.).
Nutch, an open-source online search engine created by Doug Cutting and Mike Cafarella, was one such effort. They aimed to speed up the delivery of web search results by distributing data and calculations among numerous computers so that multiple activities may be completed at the same time. Another search engine project, Google, was in the works at the time. It was founded on the same idea: storing and processing data in a distributed, automated manner in order to produce relevant web search results faster.
Cutting joined Yahoo in 2006, bringing the Nutch project and concepts based on Google’s early work with distributed data storage and processing with him. The Nutch project was split into two parts: the web crawler piece remained Nutch, while the distributed computing and networking portions were renamed Nutch and Hadoop (named after Cutting’s son’s toy elephant) became the processing part. Hadoop was launched as an open-source project by Yahoo in 2008. The Apache Software Foundation (ASF), a global community of software developers and contributors, now manages and maintains Hadoop’s foundation and ecosystem of technologies.
HOW HADOOP WORKS
Hadoop makes it easy to make use of all of a cluster server’s storage and processing capability, as well as to run distributed operations on massive volumes of data. Hadoop provides the foundation for the development of other services and applications.
By connecting to the NameNode via an API call, applications that collect data in multiple formats can place data into the Hadoop cluster. The NameNode, which is duplicated among DataNodes, keeps track of the file directory structure and placement of \\\”chunks\\\” for each file. Provide a MapReduce job made up of several maps and reduce jobs that run on the data in HDFS scattered across the DataNodes to run a job to query the data.
The Hadoop ecosystem has grown significantly over the years due to its extensibility. Today, the Hadoop ecosystem includes many tools and applications to help collect, store, process, analyze, and manage big data. Some of the most popular applications are:
Spark – An open-source, distributed processing system commonly used for big data workloads. Apache Spark uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.
Presto – An open-source, distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3.
Hive – Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, in addition to distributed and fault-tolerant data warehousing.
HBase – An open-source, non-relational, versioned database that runs on top of Amazon S3 (using EMRFS) or the Hadoop Distributed File System (HDFS). HBase is a massively scalable, distributed big data store built for random, strictly consistent, real-time access for tables with billions of rows and millions of columns.
Zeppelin – An interactive notebook that enables interactive data exploration.
WHY IS HADOOP IMPORTANT
Ability to store and process huge amounts of any kind of data, quickly:
With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
The open-source framework is free and uses commodity hardware to store large quantities of data.
You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
Hadoop will continue to play a very vital role in the life of a Data Scientist. Have you had any experience using Apache Hadoop?
Let us know in the comment section!