Cleaning Trash in HDFS

HDFS has a feature where whatever the file that you delete, it will get moved into trash, which acts like a recycle bin. that is controlled with 2 properties,

Trash interval and Trash interval checkpoint whatever the value that we have within the trash interval, for that particular interval, the file will be kept in a .trash folder under the user directory.

Let us check how to perform the task

This step may be performed from any node within the cluster Sign-on and authenticate as the user you wish to clear out the Trash folder for

$sudo su –

Let us add a sample File in hdfs

$hdfs dfs -put test.txt /trashtest.txt

You can list the files in hdfs using

$hdfs dfs -ls

Now let us remove the file using

$hdfs dfs -rm /path/filename

The file will be moved to a folder called .trash under the user directory.

The file will be moved to folder .current that is in folder .trash.

The file will be moved from actual location to the .trash folder there will be no change in metadata

The files in the current folder are packed and made as a checkpoint which is controlled by property fs.trash.checkpointinterval in hdfs configuration.

If the value of fs.trash.checkpointinterval is set to 1 hr then after that interval file is moved to a checkpoint location from the trash folder.

The file will be kept in the checkpoint location depending on the value of Fs.trash.interval which is configured in hdfs configuration.

Default value of Fs.trash.interval is 1 day in cloudera.

If critical files need to be deleted with keeping the file in trash folder we can use -skipTrash option.

$hdfs dfs -rm -skipTrash /path/filename

If a file is deleted accidently and has been moved to trash the file can be recovered from the trash folder

$hdfs dfs -mv /user/username/.Trash/Current/filename /filename.txt

To delete the files from the trash folder we can use

$hdfs dfs -expunge

November 10, 2024

Optimizing ETL Pipelines for Databricks

Slow, inefficient ETL (Extract, Transform, Load) processes lead to delayed insights, high costs, and unreliable data pipelines. These issues are magnified when organizations fail to optimize […]