Regions in Hbase Architecture

October 23, 2023

HBase is a data module which provides quick random access to huge amounts of data. It’s a Cloumn Family Oriented NoSQL (Not only SQL) Database which is built on the top of the Hadoop Distributed File System and is suitable for faster reading and writing at large volumes of high data throughput with low I / O latency. Data in HBase is divided under ColumnFamily, and it’s a Master-Slave architecture. Where Master (Hmaster) and Slave (Region Server).

‍

Following are some short understanding of Regions in different ways:

Data of the HBase table are divided into regions.
Regions are nothing but tables that are break up and spread across the region servers.
HBase Tables are divided horizontally by row key range into “Regions.” a neighborhood contains all rows within the table between the region’s start key and end key.
Regions are assigned to the nodes within the cluster, called “Region Servers,” and these serve data for reads and writes.
The region servers have regions that –
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions thereunder.
Decide the dimensions of the region by following the region size thresholds.

‍

The store contains memory store and Hfiles.

Memstore is also called Write Buffer, The data is stored in memstore before putting data in the actual disk. Memstore is just like a cache memory. Anything that’s entered into the HBase is stored here initially. Later, the info is transferred and saved in Hfiles as blocks and therefore the memstore is flushed. If Memstore gets full then the data gets flushed and one Hfile created

‍

In HBase Architecture, a neighborhood consists of all the rows between the beginning key and therefore the end key which are assigned thereto Region. And, those Regions which we assigned to the nodes in the HBase Cluster, is what we call “Region Servers”. Basically, for the purpose of reads and writes these servers serves the data. While talking about numbers, it can serve approximately 1,000 regions. However, we manage rows in each region in HBase in sorted order.

‍

These Regions of a Region Server are responsible for several things, like handling, managing, executing as well as reads and writes HBase operations on that set of regions. The default size of a region is 256MB, which we can configure as per requirement. These are the worker nodes which handle read, write, update, and delete requests from clients. A region Server process runs on every node within the Hadoop cluster. Region Server runs on HDFS DataNode and consists of the following components

‍

Block Cache – This is the read-cache. Most frequently read data is stored within the read cache and whenever the block cache is full, recently used data is evicted.

MemStore- this is often the write cache and stores new data that’s not yet written to the disk. Every column family during a region features a MemStore.

Write-Ahead Log (WAL) may be a file that stores new data that’s not persisted to permanent storage.

HFile is that the actual storage file that stores the rows as sorted key values on a disk.

Column Family of Region contains :

Memstore
BlockCache
HFile

‍

The region contains multiple ColumnFamily Which contains 2 memories

Read (BlockCache)

BlockCache contains the data which we frequently read, If we get request later to read that data so it can read fast, and the data which is least recently used gets clear from Block Cache because it stored in memory(RAM)

Write (Memstore/WriteBuffer)‍

Memstore is also called Write Buffer, The data is stored in memstore before putting data in the actual disk.

If Memstore gets full then the data gets flushed and one Hfile created

‍

Operation on HFile

‍

When data write into the HBase that time data stores in HFile which has a very small size (KB)

‍

HBase is created to update, delete data easily, If the size of that file is large then it’ll be difficult to find that file and perform an operation, so small files are quite helpful, once we find that which file contains our records then it’ll be easy to find our data.

‍

But if the data is too large like in TB and then it’ll create lots of small files and it’ll be difficult to manage all these small files that’s why the COMPACTIONS concept is introduced in HBase. We can do it with two ways Major Compaction for multiple files into one file and Minor Compaction is for two files into one.

‍

December 17, 2024

Categories

Uncategorized

Pioneering Data Innovation: Insights from Cloudaeon Founders Shashikant Mundlik and Amol Malpani

Introduction 2024 has been a landmark year for Cloudaeon, as the company continues its mission to help organisations harness the true value of Cloud, Data, and […]

December 3, 2024

Categories

Uncategorized

Uncorking Insights: A Night of Data, Mystery, and Connection at the Data Leaders Executive Lounge

Introduction Last week, Vinoteca in London played host to an extraordinary gathering of data leaders for the latest Data Leaders Executive Lounge. This exclusive networking soirée, […]

November 10, 2024

Categories

Uncategorized

Medallion Architecture: A Practical Approach to Data Loading Patterns

In the data-driven economy, businesses need a streamlined, reliable way to handle and organize data, especially as data grows in volume and complexity. For many companies, […]

November 10, 2024

Categories

Uncategorized

Optimizing ETL Pipelines for Databricks

Slow, inefficient ETL (Extract, Transform, Load) processes lead to delayed insights, high costs, and unreliable data pipelines. These issues are magnified when organizations fail to optimize […]