In this tutorial,Firstly , you will learn what is BigData? and then What is HDFS? at last you will cover Hadoop and their Architecture.
What is BIG DATA?
The data that is , always increasing in size and can’t be processed and stored on a single machine is called as BigData. In other words, we can say a large volume of data.
Where Big data is used?
We are using bigdata in Social Networking sites, Healthcare, Banking, Education etc.
One live example like when you start searching any product on Amazon then it provides recommended data and similar products as per your searching criteria.
In this cluster,the machines are connected to each other via network, to acts as a single system.
Machines are commodity hardware (CPU+RAM) and these are stacked together on a rack. These racks are installed in physical location called as Data_centers.
Big data pipelines- There are some steps that are- 1-Big data ingestion –(Sqoop/Flume) The data is coming from different and multiple sources. 2-Data validation and cleanup & processing (Spark) In this phase, we validate and cleanup our data and process the data. 3-Data analysis (Hive) In this phase, we do some data analysis as per business requirement. 4-Data visualization (Tableau) We can create report that helps to communicate information clearly to users.
What is HDFS?
It stands for Hadoop Distributed File System.
HDFS is primary data_storage system under hadoop applications.
2-Distributed File system
When we use distributed file system?
When data becomes large enough to accommodate on a single machine it becomes necessary to break it and distribute on multiple machines.
3-Block size (128MB)
HDFS stores every file as a block.
The default size of a block in HDFS is 128MB.
It also replicates (creates exact copy of) those blocks to provide the fault tolerance in case of failures.
The default Replicator Factor is 3.
You have 1GB of data.The block size in HDFS is 128MB then it creates 8 blocks.
1GB=128MB1024/128=8 blocksReplicator Factor: 3 (creates exact 3 copies)
Block size and Replicator Factor default provided by hadoop.
You can change block size and replicator factor as per your convenience.
Architecture in Hadoop-
Hadoop uses master and slave architecture.
1-Name node– stores meta information.
It knows which block of file goes to which machine.
Name node is responsible for dividing the file and storing all meta information.
This node stores all data related information.
We have one name node in cluster that act as master node and several data nodes that act as slave nodes.