The grid computing is an architectural technique that distribute a processing job between a cluster of machines. The processing power gained combined with a very large storage space makes the foundation of CLOUD COMPUTING. By definition, the cloud computing is the association between the GRID architecture and the flexible business model based on billing the consumed resources. So what is Hadoop in all this ?

Hadoop is a set of tools (you can call it a framework, a bit diminutive) maintained by the Apache foundation used to set a distributed processing and storage environment.The project was built based on Google’s papers describing a distributed processing and storage file system, Yahoo also was one of the big pushers in this project. Why not other solution, why Hadoop ? Hadoop is well suited and built for the only purpose to process large data (TB) efficiently. To give some figures, Yahoo Labs sorted 500 GB of data in 59 seconds ! and Facebook uses Hadoop too, to list some of the biggest tech companies that use it.

MapReduce is a programming model (introduced by Google Original paper) and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

HDFS (Hadoop distributed file system):
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. This module introduces the design of this distributed file system and instructions on how to operate it.
A distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network. There are a number of distributed file systems that solve this problem in different ways.
NFS, the Network File System, is the most ubiquitous distributed file system. It is one of the oldest still in use. While its design is straightforward, it is also very constrained. NFS provides remote access to a single logical volume stored on a single machine. An NFS server makes a portion of its local file system visible to external clients. The clients can then mount this remote file system directly into their own Linux file system, and interact with it as though it were part of the local drive.
One of the primary advantages of this model is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open(), close(), fread(), etc. will work on files hosted over NFS.

But as a distributed file system, it is limited in its power. The files in an NFS volume all reside on a single machine. This means that it will only store as much information as can be stored in one machine, and does not provide any reliability guarantees if that machine goes down (e.g., by replicating the files to other servers). Finally, as all the data is stored on a single machine, all the clients must go to this machine to retrieve their data. This can overload the server if a large number of clients must be handled. Clients must also always copy the data to their local machines before they can operate on it.

Pig was initially developed at Yahoo! to allow people using Hadoop® to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!
Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application.

Hive was built by Facebook, it is a data warehouse system for Hadoop that allows easy data aggregation, ad-hoc queries and analysis of large data sets stored in Hadoop compatible file systems. HiveQL is a SQL “like” language that can be used to interact with the data and it also allows developers to put in their own custom mappers/reducers.


Published on

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *


Full Stack Cloudiologist Mind

Back to Home