How to Resolve Top Performance Issue in Impala?

What is Impala?

Impala is an MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in a Hadoop cluster.

It is an open-source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.

In other words, Impala is the highest performing SQL engine (giving RDBMS-like experience) which provides the fastest way to access data that is stored in the Hadoop Distributed File System.

We have a major issue where many queries end up in the “Waiting to be closed” state in the `queries` of the Impala Daemon UI.

Generally, Impala does not close the query before the client fetches the full data and sometimes clients(Hue) leave the query open.

It does not close the query unless we close the tab so we have added the session timeout, which means when the session is not used for some time it will close all the queries and the session is closed.

Most of the times many users set this setting lower than the default. So people do not face the session time out error and can wait for queries to fetch the data.

So queries in Impala end up with the ‘waiting to closed’ state. When a query is in waiting to be closed state it is completed but still holds the resources, which causes the issue to run the new query if multiple queries go in the same state and user face impala performance issue. We can clear those queries manually by going to impala daemon UI like http://impala_daemon_hostname:25000/queries.

In the Hadoop cluster, as we have multiple nodes and Impala daemon it is time-consuming to connect to each node and then close the queries.

The below python script uses the CM API to close the queries in ‘waiting to be closed’ state. It will connect to each node and close the queries.

import urllib, json

datanodes =[“http://impala_daemon_hostname_1:25000/”,

“http://impala_daemon_hostname_2:25000/”,

“http://impala_daemon_hostname_3:25000/”,

“http://impala_daemon_hostname_4:25000/”,

“http://impala_daemon_hostname_n:25000/”]

for i, datanode in enumerate(datanodes):

print(“Checking {}: {}”.format(i, datanode)) try:

response = urllib.urlopen(datanode + “queries?json”)

data = json.loads(response.read())

#print data

if data[“num_waiting_queries”] > 0:

print(data[“num_waiting_queries”])

for in_flight_query in data[“in_flight_queries”]:

if in_flight_query[“waiting”] is True and in_flight_query[‘state’] == “FINISHED” or in_flight_query[‘state’] == “EXCEPTION”: cancel_url = datanode + “cancel_query?query_id={}”.format(in_flight_query[‘query_id’]) print(cancel_url)

response = urllib.urlopen(cancel_url)

except IOError:

print(“Skipping {}: {}”.format(i, datanode))

except Exception as e:

print(e)

Conclusion: As a result, we know about Impala & performance issues in Impala & how to solve it.


November 10, 2024

Optimizing ETL Pipelines for Databricks

Slow, inefficient ETL (Extract, Transform, Load) processes lead to delayed insights, high costs, and unreliable data pipelines. These issues are magnified when organizations fail to optimize […]