Introduction to Apache Hbase Architecture

Cloudaeon

Oct 23, 20234 min read

Apache Hadoop has gained popularity in the storage, management, and processing of large amounts of data because it can handle large volumes of highly structured data. However, Hadoop cannot handle random high-speed writing and reading and cannot change files without completely rewriting. HBase is a column-based NoSQL Hadoop database that overcomes the shortcomings of HDFS by enabling fast arbitrary writing and reading in an optimized way. In addition, relational databases with exponentially growing data cannot handle various data for better performance. HBase architecture offers scalability and separation for efficient storage and retrieval.

‍

HBase is a data module which provides quick random access to huge amounts of data. It’s a Column Family Oriented NoSQL (Not only SQL) Database which is built on the top of the Hadoop Distributed File System and is suitable for faster reading and writing at large volumes of big data throughput with low I / O latency. Hbase is used for Performance & Scalability. HBase is known for its exceptional scalability because it can handle an increase in load and performance demands by adding various server nodes. This provides optimal performance when consistency is very important and allows developers with modern SQL systems, distributed systems.

‍

Let’s have a look into SQL and NoSQL functionalities:

‍

SQL: Rigid Shema, Consistency, Transactions (Scans every row)

NoSQL: Speed, Flexibility, Scaling (Go directly to Column)

Useful for bulk data

Column Oriented

Document Oriented

Key-Value Store

Graph Oriented

‍

HBase is on Hadoop/HDFS so HDFS’s features are also applicable to HBase

‍

Features:

Fault tolerance
Replication
provides permission to Random Real-Time Access.
High Availability
Fast Processing
Can access through Java API or Thrift server or REST

‍

HBase can use on Large data volumes TB or PB, Also where we don’t need RDMS features like Transactions, Complex Queries, Complex Joins their we can use HBase.

Facebook, Adobe, Twitter, Yahoo, etc use HBase.

Data in HBase is divided under ColumnFamily, and it’s a Master-Slave architecture

Master (HMaster)

Slave (Region Server)

‍

The sequence of process is like:

Data on the HBase table are divided into regions.

256 MB is the default size of Region and it’s configurable.

Storing data in the first Region of 256MB gets full then the next data is inserted into a new region.

Size of Regions is configurable but it’s better to keep it as 256 MB, if we change it for large files then it affects performance.

Column Family of Region contains :

Memstore
BlockCache
HFile

Write Operation:

WAL: Write Ahead Log

When Data gets written in HBase that’s written in Hlog i.e Write Head Log or in the Memstore.

Write Head Log is a file that maintains all Region server, means in future we lost some data in the Region server then we can pick up that data from the Write-Ahead Log.

Memstore is also called Write Buffer, The data is stored in memstore before putting data in the actual disk.

If Memstore gets full then the data gets flushed and one Hfile created

One table can contain multiple Regions

One Region server can contain multiple regions

The region contains multiple ColumnFamily Which contains 2 memories

Read (BlockCache)

BlockCache contains the data which we frequently read, If we get request later to read that data so it can read fast, and the data which is least recently used gets clear from Block Cache because it stored in memory(RAM)

Write (Memstore/WriteBuffer)

Memstore is also called Write Buffer, The data is stored in memstore before putting data in the actual disk.

If Memstore gets full then the data gets flushed and one Hfile created

Region server handles multiple regions

‍

HMaster

‍

In HBase, HMaster handles multiple Region server

Create, delete, update operation performed through HMaster

Assign Region to any region server done by HMaster

Recover and Load balancing, reassigning Regions done by HMaster

Region server manages the recovery of the failed Region server

‍

Zookeeper

‍

HMaster and all Region servers send heartbeat signals to Zookeeper to acknowledge that they all are active and alive. If any Region server crashes then it failed to send heartbeat and zookeeper can get to know the server failed.

In HBase

Active HMaster (sends a heartbeat to Zookeeper)
Inactive HMaster

If one fails another takes place on Active HMaster by zookeeper

Manages Root Metadata server

In HBase to handle Read and Write operation, there are two tables

Root Table (Only one in the whole cluster)
Meta table (can be more than one)

Both table handle by Zookeeper

Both tables stores on Region server, Which contains details of region server, which datastores on which region server, which region stores on which region server

When we need to read any data, then it gets ask to table that where is the data, then that table gives us the location of the region, then Memstore, BlockCache, and HFile gets read if that data find then HBase provides that data to the user

‍

Compactions

‍

When data write into the HBase that time data stores in HFile which has a very small size (KB)

HBase is created to update, delete data easily, If the size of that file is large then it’ll be difficult to find that file and perform the operation, so small files are quite helpful, once we find that which file contains our records then it’ll be easy to find our data.

But if the data is too large like in TB and then it’ll create lots of small files and it’ll be difficult to manage all these small files that’s why the COMPACTIONS concept is introduced in HBase

There are two types of Compactions

Major compactions–
If we having Region in that Column Family is stored, in that there are 4 HFiles and Combining all 4 HFiles we create 1 HFile, this task done by Admin in non-peak hours

(Combining all Hfiles of ColumnFamily into one HFile)

Minor Compactions–
If we having Region in that Column Family is stored, in that there are 4 HFiles and Combining 2 2 HFiles we create 2 HFiles This is the example of Minor Compaction

(Framework does the minor compaction, we don’t need to do anything, we can set criteria for Minor Compaction)

‍

Conclusion: HBase is one of the NonSql-oriented columns that are distributed in the queue. Compared to Hadoop or Hive, HBase performs better when taking fewer notes. In this article, we look at HBase architecture and its important components.

‍